首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
We propose a novel “tree-averaging” model that uses the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian ensemble trees (BET) and model them as a Dirichlet process. We show that BET determines the optimal number of trees by adapting to the data heterogeneity. Compared with the other ensemble methods, BET requires much fewer trees and shows equivalent prediction accuracy using weighted averaging. Moreover, each tree in BET provides variable selection criterion and interpretation for each subset. We developed an efficient estimating procedure with improved estimation strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations and illustrate the approach with a real-world data example involving regression of lung function measurements obtained from patients with cystic fibrosis. Supplementary materials for this article are available online.  相似文献   

2.
The logistic regression framework has been for long time the most used statistical method when assessing customer credit risk. Recently, a more pragmatic approach has been adopted, where the first issue is credit risk prediction, instead of explanation. In this context, several classification techniques have been shown to perform well on credit scoring, such as support vector machines among others. While the investigation of better classifiers is an important research topic, the specific methodology chosen in real world applications has to deal with the challenges arising from the real world data collected in the industry. Such data are often highly unbalanced, part of the information can be missing and some common hypotheses, such as the i.i.d. one, can be violated. In this paper we present a case study based on a sample of IBM Italian customers, which presents all the challenges mentioned above. The main objective is to build and validate robust models, able to handle missing information, class unbalancedness and non-iid data points. We define a missing data imputation method and propose the use of an ensemble classification technique, subagging, particularly suitable for highly unbalanced data, such as credit scoring data. Both the imputation and subagging steps are embedded in a customized cross-validation loop, which handles dependencies between different credit requests. The methodology has been applied using several classifiers (kernel support vector machines, nearest neighbors, decision trees, Adaboost) and their subagged versions. The use of subagging improves the performance of the base classifier and we will show that subagging decision trees achieve better performance, still keeping the model simple and reasonably interpretable.  相似文献   

3.
While statistical learning methods have proved powerful tools for predictive modeling, the black-box nature of the models they produce can severely limit their interpretability and the ability to conduct formal inference. However, the natural structure of ensemble learners like bagged trees and random forests has been shown to admit desirable asymptotic properties when base learners are built with proper subsamples. In this work, we demonstrate that by defining an appropriate grid structure on the covariate space, we may carry out formal hypothesis tests for both variable importance and underlying additive model structure. To our knowledge, these tests represent the first statistical tools for investigating the underlying regression structure in a context such as random forests. We develop notions of total and partial additivity and further demonstrate that testing can be carried out at no additional computational cost by estimating the variance within the process of constructing the ensemble. Furthermore, we propose a novel extension of these testing procedures using random projections to allow for computationally efficient testing procedures that retain high power even when the grid size is much larger than that of the training set.  相似文献   

4.
建立科学、有效、准确的空气质量预测系统,对于保护人们的身体健康和促进社会的和谐稳定具有重要的科学价值和实际意义。研究聚焦化工园区,基于物联网背景下企业排放实时数据,融合气象信息,采用多种有监督式机器学习(决策树、多元线性回归、Lasso回归、支持向量机、Xgboost、梯度提升机、Light GBM、MLP(多层感知觉神经网络))及改进的集成学习Stacking策略实现化工园区空气质量的预测,并识别影响大气污染的关键因素。结果表明:(1)Stacking策略下的预测框架与单模型预测结果相比有统计学意义上的显著提升。(2)在Stacking策略中,初级、次级学习器的选择策略影响预测的精度和泛化性,最佳模式为初级采用强学习器,次级使用线性模型。(3)在同一园区、不同企业污染物不同排放口对空气质量影响不同,研究结论可为政府监管部门对化工园区的治理和管控提供决策支持。  相似文献   

5.
The purpose of this paper is to develop an early warning system to predict currency crises. In this study, a data set covering the period of January 1992–December 2011 of Turkish economy is used, and an early warning system is developed with artificial neural networks (ANN), decision trees, and logistic regression models. Financial Pressure Index (FPI) is an aggregated value, composed of the percentage changes in dollar exchange rate, gross foreign exchange reserves of the Central Bank, and overnight interest rate. In this study, FPI is the dependent variable, and thirty-two macroeconomic indicators are the independent variables. Three models, which are tested in Turkish crisis cases, have given clear signals that predicted the 1994 and 2001 crises 12 months earlier. Considering all three prediction model results, Turkey’s economy is not expected to have a currency crisis (ceteris paribus) until the end of 2012. This study presents uniqueness in that decision support model developed in this study uses basic macroeconomic indicators to predict crises up to a year before they actually happened with an accuracy rate of approximately 95%. It also ranks the leading factors of currency crisis with regard to their importance in predicting the crisis.  相似文献   

6.
In this paper, we study the performance of various state-of-the-art classification algorithms applied to eight real-life credit scoring data sets. Some of the data sets originate from major Benelux and UK financial institutions. Different types of classifiers are evaluated and compared. Besides the well-known classification algorithms (eg logistic regression, discriminant analysis, k-nearest neighbour, neural networks and decision trees), this study also investigates the suitability and performance of some recently proposed, advanced kernel-based classification algorithms such as support vector machines and least-squares support vector machines (LS-SVMs). The performance is assessed using the classification accuracy and the area under the receiver operating characteristic curve. Statistically significant performance differences are identified using the appropriate test statistics. It is found that both the LS-SVM and neural network classifiers yield a very good performance, but also simple classifiers such as logistic regression and linear discriminant analysis perform very well for credit scoring.  相似文献   

7.
The last years have seen the development of many credit scoring models for assessing the creditworthiness of loan applicants. Traditional credit scoring methodology has involved the use of statistical and mathematical programming techniques such as discriminant analysis, linear and logistic regression, linear and quadratic programming, or decision trees. However, the importance of credit grant decisions for financial institutions has caused growing interest in using a variety of computational intelligence techniques. This paper concentrates on evolutionary computing, which is viewed as one of the most promising paradigms of computational intelligence. Taking into account the synergistic relationship between the communities of Economics and Computer Science, the aim of this paper is to summarize the most recent developments in the application of evolutionary algorithms to credit scoring by means of a thorough review of scientific articles published during the period 2000–2012.  相似文献   

8.
Supervised classification learning can be considered as an important tool for decision support. In this paper, we present a method for supervised classification learning, which ensembles decision trees obtained via convex sets of probability distributions (also called credal sets) and uncertainty measures. Our method forces the use of different decision trees and it has mainly the following characteristics: it obtains a good percentage of correct classifications and an improvement in time of processing compared with known classification methods; it not needs to fix the number of decision trees to be used; and it can be parallelized to apply it on very large data sets.  相似文献   

9.
Several activity-based transportation models are now becoming operational and are entering the stage of application for the modelling of travel demand. Some of these models use decision rules to support its decision-making instead of principles of utility maximization. Decision rules can be derived from different modelling approaches. In a previous study, it was shown that Bayesian networks outperform decision trees and that they are better suited to capture the complexity of the underlying decision-making. However, one of the disadvantages is that Bayesian networks are somewhat limited in terms of interpretation and efficiency when rules are derived from the network, while rules derived from decision trees in general have a simple and direct interpretation. Therefore, in this study, the idea of combining decision trees and Bayesian networks was explored in order to maintain the potential advantages of both techniques. The paper reports the findings of a methodological study that was conducted in the context of Albatross, which is a sequential rule based model of activity scheduling behaviour. To this end, the paper can be situated within the context of a series of previous publications by the authors to improve decision-making in Albatross. The results of this study suggest that integrated Bayesian networks and decision trees can be used for modelling the different choice facets of Albatross with better predictive power than CHAID decision trees. Another conclusion is that there are initial indications that the new way of integrating decision trees and Bayesian networks has produced a decision tree that is structurally more stable.  相似文献   

10.
Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.  相似文献   

11.
A new logistic regression algorithm based on evolutionary product-unit (PU) neural networks is used in this paper to determine the assets that influence the decision of poor households with respect to the cultivation of non-traditional crops (NTC) in the Guatemalan Highlands. In order to evaluate high-order covariate interactions, PUs were considered to be independent variables in product-unit neural networks (PUNN) analysing two different models either including the initial covariates (logistic regression by the product-unit and initial covariate model) or not (logistic regression by the product-unit model). Our results were compared with those obtained using a standard logistic regression model and allow us to interpret the most relevant household assets and their complex interactions when adopting NTC, in order to aid in the design of rural policies.  相似文献   

12.
Regression trees are a popular alternative to classical regression methods. A number of approaches exist for constructing regression trees. Most of these techniques, including CART, are sequential in nature and locally optimal at each node split, so the final tree solution found may not be the best tree overall. In addition, small changes in the training data often lead to large changes in the final result due to the relative instability of these greedy tree-growing algorithms. Ensemble techniques, such as random forests, attempt to take advantage of this instability by growing a forest of trees from the data and averaging their predictions. The predictive performance is improved, but the simplicity of a single-tree solution is lost.

In earlier work, we introduced the Tree Analysis with Randomly Generated and Evolved Trees (TARGET) method for constructing classification trees via genetic algorithms. In this article, we extend the TARGET approach to regression trees. Simulated data and real world data are used to illustrate the TARGET process and compare its performance to CART, Bayesian CART, and random forests. The empirical results indicate that TARGET regression trees have better predictive performance than recursive partitioning methods, such as CART, and single-tree stochastic search methods, such as Bayesian CART. The predictive performance of TARGET is slightly worse than that of ensemble methods, such as random forests, but the TARGET solutions are far more interpretable.  相似文献   

13.
Bayesian networks are one of the most widely used tools for modeling multivariate systems. It has been demonstrated that more expressive models, which can capture additional structure in each conditional probability table (CPT), may enjoy improved predictive performance over traditional Bayesian networks despite having fewer parameters. Here we investigate this phenomenon for models of various degree of expressiveness on both extensive synthetic and real data. To characterize the regularities within CPTs in terms of independence relations, we introduce the notion of partial conditional independence (PCI) as a generalization of the well-known concept of context-specific independence (CSI). To model the structure of the CPTs, we use different graph-based representations which are convenient from a learning perspective. In addition to the previously studied decision trees and graphs, we introduce the concept of PCI-trees as a natural extension of the CSI-based trees. To identify plausible models we use the Bayesian score in combination with a greedy search algorithm. A comparison against ordinary Bayesian networks shows that models with local structures in general enjoy parametric sparsity and improved out-of-sample predictive performance, however, often it is necessary to regulate the model fit with an appropriate model structure prior to avoid overfitting in the learning process. The tree structures, in particular, lead to high quality models and suggest considerable potential for further exploration.  相似文献   

14.
15.
In this paper, we develop a method for localization of the active domains of the brain by electroencephalography signals on the basis of ensembles of random decision trees. We suggest a method for reducing the problem of localization to the problem of classification of the quality of the dipole sources. We present a localization algorithm, which consists in constructing an ensemble of decision trees for the parameters of the dipole sources (which are responsible for the approximation of the potential detected at various time instants) and finding the most probable source regions on the basis of a special voting procedure. It is demonstrated that the approach that employs the decision trees technique allows a stable determination of the parameters of the transient function between the source and the signal detected, which is essential for the construction of the brain-computer interface. The method suggested is demonstrated to converge both on the exact solutions and on signals of real experiments by the analysis of evoked potentials. Original Russian Text E.A. Popova, 2008, published in Vestnik Moskovskogo Universiteta. Vychislitel’naya Matematika i Kibernetika, 2008, No. 3, pp. 46–55.  相似文献   

16.

We investigate the application of ensemble transform approaches to Bayesian inference of logistic regression problems. Our approach relies on appropriate extensions of the popular ensemble Kalman filter and the feedback particle filter to the cross entropy loss function and is based on a well-established homotopy approach to Bayesian inference. The arising finite particle evolution equations as well as their mean-field limits are affine-invariant. Furthermore, the proposed methods can be implemented in a gradient-free manner in case of nonlinear logistic regression and the data can be randomly subsampled similar to mini-batching of stochastic gradient descent. We also propose a closely related SDE-based sampling method which again is affine-invariant and can easily be made gradient-free. Numerical examples demonstrate the appropriateness of the proposed methodologies.

  相似文献   

17.
Companies' interest in customer relationship modelling and key issues such as customer lifetime value and churn has substantially increased over the years. However, the complexity of building, interpreting and applying these models creates obstacles for their implementation. The main contribution of this paper is to show how domain knowledge can be incorporated in the data mining process for churn prediction, viz. through the evaluation of coefficient signs in a logistic regression model, and secondly, by analysing a decision table (DT) extracted from a decision tree or rule-based classifier. An algorithm to check DTs for violations of monotonicity constraints is presented, which involves the repeated application of condition reordering and table contraction to detect counter-intuitive patterns. Both approaches are applied to two telecom data sets to empirically demonstrate how domain knowledge can be used to ensure the interpretability of the resulting models.  相似文献   

18.
Automatic construction of decision trees for classification   总被引:1,自引:0,他引:1  
An algorithm for learning decision trees for classification and prediction is described which converts real-valued attributes into intervals using statistical considerations. The trees are automatically pruned with the help of a threshold for the estimated class probabilities in an interval. By means of this threshold the user can control the complexity of the tree, i.e. the degree of approximation of class regions in feature space. Costs can be included in the learning phase if a cost matrix is given. In this case class dependent thresholds are used.Some applications are described, especially the task of predicting the high water level in a mountain river.  相似文献   

19.
Estimation of probability density functions (PDF) is a fundamental concept in statistics. This paper proposes an ensemble learning approach for density estimation using Gaussian mixture models (GMM). Ensemble learning is closely related to model averaging: While the standard model selection method determines the most suitable single GMM, the ensemble approach uses a subset of GMM which are combined in order to improve precision and stability of the estimated probability density function. The ensemble GMM is theoretically investigated and also numerical experiments were conducted to demonstrate benefits from the model. The results of these evaluations show promising results for classifications and the approximation of non-Gaussian PDF.  相似文献   

20.
This paper combines the use of (binary) logistic regression and stochastic frontier analysis to assess the operational effectiveness of the UK Coastguard (Maritime Rescue) coordination centres over the period 1995–1998. In particular, the rationale for the Government's decision—confirmed in 1999—to close a number of coordination centres is scrutinized. We conclude that the regression models developed in this paper represent a performance measurement framework that is considerably more realistic and complex than the one apparently used by the UK Government. Furthermore, we have found that the coordination centres selected for closure were not necessarily the ones that were least effective in their primary purpose—that is, to save lives. In a related paper, we demonstrate how the regression models developed here can be used to inform the application of data envelopment analysis to this case.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号