首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 419 毫秒
1.
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.  相似文献   

2.
Multi-objective evolutionary algorithms for the construction of neural ensembles is a relatively new area of research. We recently proposed an ensemble learning algorithm called DIVACE (DIVerse and ACcurate Ensemble learning algorithm). It was shown that DIVACE tries to find an optimal trade-off between diversity and accuracy as it searches for an ensemble for some particular pattern recognition task by treating these two objectives explicitly separately. A detailed discussion of DIVACE together with further experimental studies form the essence of this paper. A new diversity measure which we call Pairwise Failure Crediting (PFC) is proposed. This measure forms one of the two evolutionary pressures being exerted explicitly in DIVACE. Experiments with this diversity measure as well as comparisons with previously studied approaches are hence considered. Detailed analysis of the results show that DIVACE, as a concept, has promise.  相似文献   

3.
在统计学与机器学习中,交叉验证被广泛应用于评估模型的好坏.但交叉验证法的表现一般不稳定,因此评估时通常需要进行多次交叉验证并通过求均值以提高交叉验证算法的稳定性.文章提出了一种基于空间填充准则改进的k折交叉验证方法,它的思想是每一次划分的训练集和测试集均具有较好的均匀性.模拟结果表明,文章所提方法在五种分类模型(k近邻,决策树,随机森林,支持向量机和Adaboost)上对预测精度的估计均比普通k折交叉验证的高.将所提方法应用于骨质疏松实际数据分析中,根据对预测精度的估计选择了最优的模型进行骨质疏松患者的分类预测.  相似文献   

4.
本文给出了集成学习模型可以收敛的集成学习算法,拟自适应分类随机森林算法。拟自适应分类随机森林算法综合了Adaboost算法和随机森林算法的优势,实验数据分析表明,训练集较大时,拟自适应随机森林算法的效果会好于随机森林算法。另外,拟自适应分类随机森林算法的收敛性确保它的推广误差可以通过训练集估计,所以,对于实际数据,拟自适应分类随机森林算法不需要把数据划分为训练集和测试集,从而,可以有效的利用数据信息。  相似文献   

5.
Canonical Forest     
We propose a new classification ensemble method named Canonical Forest. The new method uses canonical linear discriminant analysis (CLDA) and bootstrapping to obtain accurate and diverse classifiers that constitute an ensemble. We note CLDA serves as a linear transformation tool rather than a dimension reduction tool. Since CLDA will find the transformed space that separates the classes farther in distribution, classifiers built on this space will be more accurate than those on the original space. To further facilitate the diversity of the classifiers in an ensemble, CLDA is applied only on a partial feature space for each bootstrapped data. To compare the performance of Canonical Forest and other widely used ensemble methods, we tested them on 29 real or artificial data sets. Canonical Forest performed significantly better in accuracy than other ensemble methods in most data sets. According to the investigation on the bias and variance decomposition, the success of Canonical Forest can be attributed to the variance reduction.  相似文献   

6.
In this paper, we propose two anomaly detection algorithms PAV and MPAV on time series. The first basic idea of this paper defines that the anomaly pattern is the most infrequent time series pattern, which is the lowest support pattern. The second basic idea of this paper is that PAV detects directly anomalies in the original time series, and MPAV algorithm extraction anomaly in the wavelet approximation coefficient of the time series. For complexity analyses, as the wavelet transform have the functions to compress data, filter noise, and maintain the basic form of time series, the MPAV algorithm, while maintaining the accuracy of the algorithm improves the efficiency. As PAV and MPAV algorithms are simple and easy to realize without training, this proposed multi-scale anomaly detection algorithm based on infrequent pattern of time series can therefore be proved to be very useful for computer science applications.  相似文献   

7.
蒋翠清  梁坤  丁勇  段锐 《运筹与管理》2017,26(2):135-139
网络借贷环境下基于Adaboost的信用评价方法具有较高的基分类器分歧度和样本误分代价。现有研究没有考虑分歧度和误分代价对基分类器样本权重的影响,从而降低了网络借贷信用评价结果的有效性。为此,提出一种基于改进Adaboost的信用评价方法。该方法根据基分类器的误分率,样本在不同基分类器上分类结果的分歧程度,以及样本的误分代价等因素,调整Adaboost模型的样本赋权策略,使得改进后的Adaboost模型能够对分类困难样本和误分代价高的样本实施有针对性的学习,从而提高网络借贷信用评价结果的有效性。基于拍拍贷平台数据的实验结果表明,提出的方法在分类精度和误分代价等方面显著优于传统的基于Adaboost的信用评价方法。  相似文献   

8.
The performance of kernel-based method, such as support vector machine (SVM), is greatly affected by the choice of kernel function. Multiple kernel learning (MKL) is a promising family of machine learning algorithms and has attracted many attentions in recent years. MKL combines multiple sub-kernels to seek better results compared to single kernel learning. In order to improve the efficiency of SVM and MKL, in this paper, the Kullback–Leibler kernel function is derived to develop SVM. The proposed method employs an improved ensemble learning framework, named KLMKB, which applies Adaboost to learning multiple kernel-based classifier. In the experiment for hyperspectral remote sensing image classification, we employ feature selected through Optional Index Factor (OIF) to classify the satellite image. We extensively examine the performance of our approach in comparison to some relevant and state-of-the-art algorithms on a number of benchmark classification data sets and hyperspectral remote sensing image data set. Experimental results show that our method has a stable behavior and a noticeable accuracy for different data set.  相似文献   

9.
Ensemble classification techniques such as bagging, (Breiman, 1996a), boosting (Freund & Schapire, 1997) and arcing algorithms (Breiman, 1997) have received much attention in recent literature. Such techniques have been shown to lead to reduced classification error on unseen cases. Even when the ensemble is trained well beyond zero training set error, the ensemble continues to exhibit improved classification error on unseen cases. Despite many studies and conjectures, the reasons behind this improved performance and understanding of the underlying probabilistic structures remain open and challenging problems. More recently, diagnostics such as edge and margin (Breiman, 1997; Freund & Schapire, 1997; Schapire et al., 1998) have been used to explain the improvements made when ensemble classifiers are built. This paper presents some interesting results from an empirical study performed on a set of representative datasets using the decision tree learner C4.5 (Quinlan, 1993). An exponential-like decay in the variance of the edge is observed as the number of boosting trials is increased. i.e. boosting appears to “homogenise ”the edge. Some initial theory is presented which indicates that a lack of correlation between the errors of individual classifiers is a key factor in this variance reduction.  相似文献   

10.
This paper develops an efficient particle tracking algorithm to be used in fluid simulations approximated by a high-order multidomain discretization of the Navier–Stokes equations. We discuss how to locate a particle's host subdomain, how to interpolate the flow field to its location, and how to integrate its motion in time. A search algorithm for the nearest subdomain and quadrature point, tuned to a typical quadrilateral isoparametric spectral subdomain, takes advantage of the inverse of the linear blending equation. We show that to compute particle-laden flows, a sixth-order Lagrangian polynomial that uses points solely within a subdomain is sufficiently accurate to interpolate the carrier phase variables to the particle position. Time integration of particles with a lower-order Adams–Bashforth scheme, rather than the fourth-order Runge–Kutta scheme often used for the integration of the carrier phase, increases computational efficiency while maintaining engineering accuracy. We verify the tracking algorithm with numerical tests on a steady channel flow and an unsteady backward-facing step flow.  相似文献   

11.
Cancer classification using genomic data is one of the major research areas in the medical field. Therefore, a number of binary classification methods have been proposed in recent years. Top Scoring Pair (TSP) method is one of the most promising techniques that classify genomic data in a lower dimensional subspace using a simple decision rule. In the present paper, we propose a supervised classification technique that utilizes incremental generalized eigenvalue and top scoring pair classifiers to obtain higher classification accuracy with a small training set. We validate our method by applying it to well known microarray data sets.  相似文献   

12.
受推荐系统在电子商务领域重大经济利益的驱动,恶意用户以非法牟利为目的实施托攻击,操纵改变推荐结果,使推荐系统面临严峻的信息安全威胁,如何识别和检测托攻击成为保障推荐系统信息安全的关键。传统支持向量机(SVM)方法同时受到小样本和数据不均衡两个问题的制约。为此,提出一种半监督SVM和非对称集成策略相结合的托攻击检测方法。首先训练初始SVM,然后引入K最近邻法优化分类面附近样本的标记质量,利用标记数据和未标记数据的混合样本集减少对标记数据的需求。最后,设计一种非对称加权集成策略,重点关注攻击样本的分类准确率,降低集成分类器对数据不均衡的敏感性。实验结果表明,本文方法有效地解决了小样本问题和数据不均衡分布问题,获得了较好的检测效果。  相似文献   

13.
Many simple and complex methods have been developed to solve the classification problem. Boosting is one of the best known techniques for improving the accuracy of classifiers. However, boosting is prone to overfitting with noisy data and the final model is difficult to interpret. Some boosting methods, including AdaBoost, are also very sensitive to outliers. In this article we propose a new method, GA-Ensemble, which directly solves for the set of weak classifiers and their associated weights using a genetic algorithm. The genetic algorithm utilizes a new penalized fitness function that limits the number of weak classifiers and controls the effects of outliers by maximizing an appropriately chosen $p$ th percentile of margins. We compare the test set error rates of GA-Ensemble, AdaBoost, and GentleBoost (an outlier-resistant version of AdaBoost) using several artificial data sets and real-world data sets from the UC-Irvine Machine Learning Repository. GA-Ensemble is found to be more resistant to outliers and results in simpler predictive models than AdaBoost and GentleBoost.  相似文献   

14.
Bayesian networks (BNs) provide a powerful graphical model for encoding the probabilistic relationships among a set of variables, and hence can naturally be used for classification. However, Bayesian network classifiers (BNCs) learned in the common way using likelihood scores usually tend to achieve only mediocre classification accuracy because these scores are less specific to classification, but rather suit a general inference problem. We propose risk minimization by cross validation (RMCV) using the 0/1 loss function, which is a classification-oriented score for unrestricted BNCs. RMCV is an extension of classification-oriented scores commonly used in learning restricted BNCs and non-BN classifiers. Using small real and synthetic problems, allowing for learning all possible graphs, we empirically demonstrate RMCV superiority to marginal and class-conditional likelihood-based scores with respect to classification accuracy. Experiments using twenty-two real-world datasets show that BNCs learned using an RMCV-based algorithm significantly outperform the naive Bayesian classifier (NBC), tree augmented NBC (TAN), and other BNCs learned using marginal or conditional likelihood scores and are on par with non-BN state of the art classifiers, such as support vector machine, neural network, and classification tree. These experiments also show that an optimized version of RMCV is faster than all unrestricted BNCs and comparable with the neural network with respect to run-time. The main conclusion from our experiments is that unrestricted BNCs, when learned properly, can be a good alternative to restricted BNCs and traditional machine-learning classifiers with respect to both accuracy and efficiency.  相似文献   

15.
In this paper, a stochastic gradient descent algorithm is proposed for the binary classification problems based on general convex loss functions. It has computational superiority over the existing algorithms when the sample size is large. Under some reasonable assumptions on the hypothesis space and the underlying distribution, the learning rate of the algorithm has been established, which is faster than that of closely related algorithms.  相似文献   

16.
One issue in data classification problems is to find an optimal subset of instances to train a classifier. Training sets that represent well the characteristics of each class have better chances to build a successful predictor. There are cases where data are redundant or take large amounts of computing time in the learning process. To overcome this issue, instance selection techniques have been proposed. These techniques remove examples from the data set so that classifiers are built faster and, in some cases, with better accuracy. Some of these techniques are based on nearest neighbors, ordered removal, random sampling and evolutionary methods. The weaknesses of these methods generally involve lack of accuracy, overfitting, lack of robustness when the data set size increases and high complexity. This work proposes a simple and fast immune-inspired suppressive algorithm for instance selection, called SeleSup. According to self-regulation mechanisms, those cells unable to neutralize danger tend to disappear from the organism. Therefore, by analogy, data not relevant to the learning of a classifier are eliminated from the training process. The proposed method was compared with three important instance selection algorithms on a number of data sets. The experiments showed that our mechanism substantially reduces the data set size and is accurate and robust, specially on larger data sets.  相似文献   

17.
In this study, we present Incremental Learning and Decremented Characterization of Regularized Generalized Eigenvalue Classification (ILDC-ReGEC), a novel algorithm to train a generalized eigenvalue classifier with a substantially smaller subset of points and features of the original data. The proposed method provides a constructive way to understand the influence of new training data on an existing classification model and the grouping of features that determine the class of samples. We show through numerical experiments that this technique has comparable accuracy with respect to other methods. Furthermore, experiments show that it is possible to obtain a classification model with about 30% of the training samples and less then 5% of the initial features. Matlab implementation of the ILDC-ReGEC algorithm is freely available from the authors. Research partially supported by NSF and Air Force grants.  相似文献   

18.
神经网络集成技术能有效地提高神经网络的预测精度和泛化能力,已经成为机器学习和神经计算领域的一个研究热点.利用Bagging技术和不同的神经网络算法生成集成个体,并用偏最小二乘回归方法从中提取集成因子,再利用贝叶斯正则化神经网络对其集成,以此建立上证指数预测模型.通过上证指数开、收盘价进行实例分析,计算结果表明该方法预测精度高、稳定性好.  相似文献   

19.
Multi-label classification problems require each instance to be assigned a subset of a defined set of labels. This problem is equivalent to finding a multi-valued decision function that predicts a vector of binary classes. In this paper we study the decision boundaries of two widely used approaches for building multi-label classifiers, when Bayesian network-augmented naive Bayes classifiers are used as base models: Binary relevance method and chain classifiers. In particular extending previous single-label results to multi-label chain classifiers, we find polynomial expressions for the multi-valued decision functions associated with these methods. We prove upper boundings on the expressive power of both methods and we prove that chain classifiers provide a more expressive model than the binary relevance method.  相似文献   

20.
The logistic regression framework has been for long time the most used statistical method when assessing customer credit risk. Recently, a more pragmatic approach has been adopted, where the first issue is credit risk prediction, instead of explanation. In this context, several classification techniques have been shown to perform well on credit scoring, such as support vector machines among others. While the investigation of better classifiers is an important research topic, the specific methodology chosen in real world applications has to deal with the challenges arising from the real world data collected in the industry. Such data are often highly unbalanced, part of the information can be missing and some common hypotheses, such as the i.i.d. one, can be violated. In this paper we present a case study based on a sample of IBM Italian customers, which presents all the challenges mentioned above. The main objective is to build and validate robust models, able to handle missing information, class unbalancedness and non-iid data points. We define a missing data imputation method and propose the use of an ensemble classification technique, subagging, particularly suitable for highly unbalanced data, such as credit scoring data. Both the imputation and subagging steps are embedded in a customized cross-validation loop, which handles dependencies between different credit requests. The methodology has been applied using several classifiers (kernel support vector machines, nearest neighbors, decision trees, Adaboost) and their subagged versions. The use of subagging improves the performance of the base classifier and we will show that subagging decision trees achieve better performance, still keeping the model simple and reasonably interpretable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号