首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Credit risk models are commonly based on large internal data sets to produce reliable estimates of the probability of default (PD) that should be validated with time. However, in the real world, a substantial portion of the exposures is included in low-default portfolios (LDPs) in which the number of defaulted loans is usually much lower than the number of non-default observations. Modelling of these imbalanced data sets is particularly problematic with small portfolios in which the absence of information increases the specification error. Sovereigns, banks, or specialised retail exposures are recent examples of post-crisis portfolios with insufficient data for PD estimates, which require specific tools for risk quantification and validation. This paper explores the suitability of cooperative strategies for managing such scarce LDPs. In addition to the use of statistical and machine-learning classifiers, this paper explores the suitability of cooperative models and bootstrapping strategies for default prediction and multi-grade PD setting using two real-world credit consumer data sets. The performance is assessed in terms of out-of-sample and out-of-time discriminatory power, PD calibration, and stability. The results indicate that combinational approaches based on correlation-adjusted strategies are promising techniques for managing sparse LDPs and providing accurate and well-calibrated credit risk estimates.  相似文献   

2.
We derive Bayesian confidence intervals for the probability of default (PD), asset correlation (Rho), and serial dependence (Theta) for low default portfolios (LDPs). The goal is to reduce the probability of underestimating credit risk in LDPs. We adopt a generalized method of moments with continuous updating to estimate prior distributions for PD and Rho from historical default data. The method is based on a Bayesian approach without expert opinions. A Markov chain Monte Carlo technique, namely, the Gibbs sampler, is also applied. The performance of the estimation results for LDPs validated by Monte Carlo simulations. Empirical studies on Standard & Poor’s historical default data are also conducted.  相似文献   

3.
The support vector machine (SVM) is known for its good performance in two-class classification, but its extension to multiclass classification is still an ongoing research issue. In this article, we propose a new approach for classification, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in two-class classification, but also can naturally be generalized to the multiclass case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the support points of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a potential computational advantage over the SVM.  相似文献   

4.
Supervised classification learning can be considered as an important tool for decision support. In this paper, we present a method for supervised classification learning, which ensembles decision trees obtained via convex sets of probability distributions (also called credal sets) and uncertainty measures. Our method forces the use of different decision trees and it has mainly the following characteristics: it obtains a good percentage of correct classifications and an improvement in time of processing compared with known classification methods; it not needs to fix the number of decision trees to be used; and it can be parallelized to apply it on very large data sets.  相似文献   

5.
Uncertainty is a concept associated with data acquisition and analysis, usually appearing in the form of noise or measure error, often due to some technological constraint. In supervised learning, uncertainty affects classification accuracy and yields low quality solutions. For this reason, it is essential to develop machine learning algorithms able to handle efficiently data with imprecision. In this paper we study this problem from a robust optimization perspective. We consider a supervised learning algorithm based on generalized eigenvalues and we provide a robust counterpart formulation and solution in case of ellipsoidal uncertainty sets. We demonstrate the performance of the proposed robust scheme on artificial and benchmark datasets from University of California Irvine (UCI) machine learning repository and we compare results against a robust implementation of Support Vector Machines.  相似文献   

6.
In this paper, we study the performance of various state-of-the-art classification algorithms applied to eight real-life credit scoring data sets. Some of the data sets originate from major Benelux and UK financial institutions. Different types of classifiers are evaluated and compared. Besides the well-known classification algorithms (eg logistic regression, discriminant analysis, k-nearest neighbour, neural networks and decision trees), this study also investigates the suitability and performance of some recently proposed, advanced kernel-based classification algorithms such as support vector machines and least-squares support vector machines (LS-SVMs). The performance is assessed using the classification accuracy and the area under the receiver operating characteristic curve. Statistically significant performance differences are identified using the appropriate test statistics. It is found that both the LS-SVM and neural network classifiers yield a very good performance, but also simple classifiers such as logistic regression and linear discriminant analysis perform very well for credit scoring.  相似文献   

7.
Local search methods are widely used to improve the performance of evolutionary computation algorithms in all kinds of domains. Employing advanced and efficient exploration mechanisms becomes crucial in complex and very large (in terms of search space) problems, such as when employing evolutionary algorithms to large-scale data mining tasks. Recently, the GAssist Pittsburgh evolutionary learning system was extended with memetic operators for discrete representations that use information from the supervised learning process to heuristically edit classification rules and rule sets. In this paper we first adapt some of these operators to BioHEL, a different evolutionary learning system applying the iterative learning approach, and afterwards propose versions of these operators designed for continuous attributes and for dealing with noise. The performance of all these operators and their combination is extensively evaluated on a broad range of synthetic large-scale datasets to identify the settings that present the best balance between efficiency and accuracy. Finally, the identified best configurations are compared with other classes of machine learning methods on both synthetic and real-world large-scale datasets and show very competent performance.  相似文献   

8.
This paper investigates the performance of evolutionary algorithms in the optimization aspects of oblique decision tree construction and describes their performance with respect to classification accuracy, tree size, and Pareto-optimality of their solution sets. The performance of the evolutionary algorithms is analyzed and compared to the performance of exhaustive (traditional) decision tree classifiers on several benchmark datasets. The results show that the classification accuracy and tree sizes generated by the evolutionary algorithms are comparable with the results generated by traditional methods in all the sample datasets and in the large datasets, the multiobjective evolutionary algorithms generate better Pareto-optimal sets than the sets generated by the exhaustive methods. The results also show that a classifier, whether exhaustive or evolutionary, that generates the most accurate trees does not necessarily generate the shortest trees or the best Pareto-optimal sets.  相似文献   

9.
In several application domains such as biology, computer vision, social network analysis and information retrieval, multi-class classification problems arise in which data instances not simply belong to one particular class, but exhibit a partial membership to several classes. Existing machine learning or fuzzy set approaches for representing this type of fuzzy information mainly focus on unsupervised methods. In contrast, we present in this article supervised learning algorithms for classification problems with partial class memberships, where class memberships instead of crisp class labels serve as input for fitting a model to the data. Using kernel logistic regression (KLR) as a baseline method, first a basic one-versus-all approach is proposed, by replacing the binary-coded label vectors with [0,1]-valued class memberships in the likelihood. Subsequently, we use this KLR extension as base classifier to construct one-versus-one decompositions, in which partial class memberships are transformed and estimated in a pairwise manner. Empirical results on synthetic data and a real-world application in bioinformatics confirm that our approach delivers promising results. The one-versus-all method yields the best computational efficiency, while the one-versus-one methods are preferred in terms of predictive performance, especially when the observed class memberships are heavily unbalanced.  相似文献   

10.
Retail credit models are implemented using discrete survival analysis, enabling macroeconomic conditions to be included as time-varying covariates. In consequence, these models can be used to estimate changes in probability of default given downturn economic scenarios. Compared with traditional models, we offer improved methodologies for scenario generation and for the use of them to predict default rates. Monte Carlo simulation is used to generate a distribution of estimated default rates from which Value at Risk and Expected Shortfall are computed as a means of stress testing. Several macroeconomic variables are considered and in particular factor analysis is employed to model the structure between these variables. Two large UK data sets are used to test this approach, resulting in plausible dynamic models and stress test outcomes.  相似文献   

11.
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.  相似文献   

12.
The 2004 Basel II Accord has pointed out the benefits of credit risk management through internal models using internal data to estimate risk components: probability of default (PD), loss given default, exposure at default and maturity. Internal data are the primary data source for PD estimates; banks are permitted to use statistical default prediction models to estimate the borrowers’ PD, subject to some requirements concerning accuracy, completeness and appropriateness of data. However, in practice, internal records are usually incomplete or do not contain adequate history to estimate the PD. Current missing data are critical with regard to low default portfolios, characterised by inadequate default records, making it difficult to design statistically significant prediction models. Several methods might be used to deal with missing data such as list-wise deletion, application-specific list-wise deletion, substitution techniques or imputation models (simple and multiple variants). List-wise deletion is an easy-to-use method widely applied by social scientists, but it loses substantial data and reduces the diversity of information resulting in a bias in the model's parameters, results and inferences. The choice of the best method to solve the missing data problem largely depends on the nature of missing values (MCAR, MAR and MNAR processes) but there is a lack of empirical analysis about their effect on credit risk that limits the validity of resulting models. In this paper, we analyse the nature and effects of missing data in credit risk modelling (MCAR, MAR and NMAR processes) and take into account current scarce data set on consumer borrowers, which include different percents and distributions of missing data. The findings are used to analyse the performance of several methods for dealing with missing data such as likewise deletion, simple imputation methods, MLE models and advanced multiple imputation (MI) alternatives based on MarkovChain-MonteCarlo and re-sampling methods. Results are evaluated and discussed between models in terms of robustness, accuracy and complexity. In particular, MI models are found to provide very valuable solutions with regard to credit risk missing data.  相似文献   

13.
The aim of this article is to develop a supervised dimension-reduction framework, called spatially weighted principal component analysis (SWPCA), for high-dimensional imaging classification. Two main challenges in imaging classification are the high dimensionality of the feature space and the complex spatial structure of imaging data. In SWPCA, we introduce two sets of novel weights, including global and local spatial weights, which enable a selective treatment of individual features and incorporation of the spatial structure of imaging data and class label information. We develop an efficient two-stage iterative SWPCA algorithm and its penalized version along with the associated weight determination. We use both simulation studies and real data analysis to evaluate the finite-sample performance of our SWPCA. The results show that SWPCA outperforms several competing principal component analysis (PCA) methods, such as supervised PCA (SPCA), and other competing methods, such as sparse discriminant analysis (SDA).  相似文献   

14.
In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.  相似文献   

15.
Behavioural scoring models are generally used to estimate the probability that a customer of a financial institution who owns a credit product will default on this product in a fixed time horizon. However, one single customer usually purchases many credit products from an institution while behavioural scoring models generally treat each of these products independently. In order to make credit risk management easier and more efficient, it is interesting to develop customer default scoring models. These models estimate the probability that a customer of a certain financial institution will have credit issues with at least one product in a fixed time horizon. In this study, three strategies to develop customer default scoring models are described. One of the strategies is regularly utilized by financial institutions and the other two will be proposed herein. The performance of these strategies is compared by means of an actual data bank supplied by a financial institution and a Monte Carlo simulation study.  相似文献   

16.
现代信用风险建模的核心是估计违约率,违约率估计是否准确将直接影响信用风险建模的质量。在估计违约率的众多文献中,频率法或logistic回归等统计方法的运用非常广泛,此类统计模型的基础是大样本,它客观上需要最低数量或最优数量的违约数据,而低违约组合(LDP)是指只有很少违约数据甚至没有违约数据的组合,如何估计LDP的违约率、反映违约率的非预期波动是一个值得关注的现实问题。本文针对银行贷款LDP缺乏足够历史违约数据的情况,采用贝叶斯方法估计LDP的违约率,并进一步探讨了根据专家判断或者根据同类银行LDP违约数量的历史数据来确定先验分布的方法。在贝叶斯估计中,通过先验分布的设定,不仅可以实现违约率估计的科学性和合理性,而且可以反映违约的非预期波动,有助于银行实施谨慎稳健的风险管理。  相似文献   

17.
RNA-sample pooling is sometimes inevitable, but should be avoided in classification tasks like biomarker studies. Our simulation framework investigates a two-class classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs. The results show how the effects of pooling depend on pool size, discriminating pattern, number of informative features and the statistical learning method used (support vector machines with linear and radial kernel, random forest (RF), linear discriminant analysis, powered partial least squares discriminant analysis (PPLS-DA) and partial least squares discriminant analysis (PLS-DA)). As a measure for the pooling effect, we consider prediction error (PE) and the coincidence of important feature sets for classification based on PLS-DA, PPLS-DA and RF. In general, PPLS-DA and PLS-DA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class. The coincidence of important feature sets is larger for PLS-DA and PPLS-DA as it is for RF. RF shows the best results for patterns in which the convex hull of one class is a cover of the other class, but these depend strongly on the pool size. We complete the PE results with experimental data which we pool artificially. The PE of PPLS-DA and PLS-DA are again least influenced by pooling and are low. Additionally, we show under which assumption the PLS-DA loading weights, as a measure for importance of features regarding classification, are equal for the different designs.  相似文献   

18.
企业的历史销售记录是供应链优化研究的基础数据来源,然而,在日常的研究中,几乎所有可以通过公开途径获得的销售记录都是高度不完整的,这为研究者开展工作带来了极大的不便。为解决此问题,本文提出,以销售数据集中已有的数据为基础,使用面向时序数据的矩阵分解模型MAFTIS对其缺失的部分进行估算,从而把残缺的数据集补全完整。进一步地,为提高MAFTIS的计算效率,本文还为该模型设计了一种基于交替最小二乘法的求解策略MAFTISALS。在评估实验中,MAFTISALS被用于三个真实销售数据集的缺失记录估计,结果显示,与其它估计模型相比,MAFTISALS能获得更准确的估计结果,并且具有更高的收敛速度。  相似文献   

19.
A classification method, which comprises Fuzzy C-Means method, a modified form of the Huang-index function and Variable Precision Rough Set (VPRS) theory, is proposed for classifying labeled/unlabeled data sets in this study. This proposed method, designated as the MVPRS-index method, is used to partition the values of per conditional attribute within the data set and to achieve both the optimal number of clusters and the optimal accuracy of VPRS classification. The validity of the proposed approach is confirmed by comparing the classification results obtained from the MVPRS-index method for UCI data sets and a typical stock market data set with those obtained from the supervised neural networks classification method. Overall, the results show that the MVPRS-index method could be applied to data sets not only with labeled information but also with unlabeled information, and therefore provides a more reliable basis for the extraction of decision-making rules of labeled/unlabeled datasets.  相似文献   

20.
More and more high dimensional data are widely used in many real world applications. This kind of data are obtained from different feature extractors, which represent distinct perspectives of the data. How to classify such data efficiently is a challenge. Despite of existence of millions of unlabeled data samples, it is believed that labeling a handful of data such as the semisupervised scheme will remarkably improve the searching performance. However, the performance of semisupervised data classification highly relies on proposed models and related numerical methods. Following from the extension of the Mumford–Shah–Potts-type model in the spatially continuous setting, we propose some efficient data classification algorithms based on the alternating direction method of multipliers and the primal-dual method to efficiently deal with the nonsmoothing problem in the proposed model. The convergence of the proposed data classification algorithms is established under the framework of variational inequalities. Some balanced and unbalanced classification problems are tested, which demonstrate the efficiency of the proposed algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号