首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.  相似文献   

2.
Based on the Adaboost algorithm, a modified boosting method is proposed in this paper for solving classification problems. This method predicts the class label of an example as the weighted majority voting of an ensemble of classifiers. Each classifier is obtained by applying a given weak learner to a subsample (with size smaller than that of the original training set) which is drawn from the original training set according to the probability distribution maintained over the training set. A parameter is introduced into the reweighted scheme proposed in Adaboost to update the probabilities assigned to training examples so that the algorithm can be more accurate than Adaboost. The experimental results on synthetic and several real-world data sets available from the UCI repository show that the proposed method improves the prediction accuracy, the execution speed as well as the robustness to classification noise of Adaboost. Furthermore, the diversity–accuracy patterns of the ensemble classifiers are investigated by kappa–error diagrams.  相似文献   

3.
股票时间序列预测在经济和管理领域具有重要的应用前景,也是很多商业和金融机构成功的基础.首先利用奇异谱分析对股市时间序列重构,降低噪声并提取趋势序列.再利用C-C算法确定股市时间序列的嵌入维数和延迟阶数,对股市时间序列进行相空间重构,生成神经网络的学习矩阵.进一步利用Boosting技术和不同的神经网络模型,生成神经网络集成个体.最后采用带有惩罚项的半参数回归模型进行集成,并利用遗传算法选择最优的光滑参数,以此建立遗传算法和半参数回归的神经网络集成股市预测模型.通过上证指数开盘价进行实例分析,与传统的时间序列分析和其他集成方法对比,发现该方法能获得更准确的预测结果.计算结果表明该方法能充分反映股票价格时间序列趋势,为金融时间序列预测提供一个有效方法.  相似文献   

4.
Classification is a main data mining task, which aims at predicting the class label of new input data on the basis of a set of pre-classified samples. Multiple criteria linear programming (MCLP) is used as a classification method in the data mining area, which can separate two or more classes by finding a discriminate hyperplane. Although MCLP shows good performance in dealing with linear separable data, it is no longer applicable when facing with nonlinear separable problems. A kernel-based multiple criteria linear programming (KMCLP) model is developed to solve nonlinear separable problems. In this method, a kernel function is introduced to project the data into a higher-dimensional space in which the data will have more chance to be linear separable. KMCLP performs well in some real applications. However, just as other prevalent data mining classifiers, MCLP and KMCLP learn only from training examples. In the traditional machine learning area, there are also classification tasks in which data sets are classified only by prior knowledge, i.e. expert systems. Some works combine the above two classification principles to overcome the faults of each approach. In this paper, we provide our recent works which combine the prior knowledge and the MCLP or KMCLP model to solve the problem when the input consists of not only training examples, but also prior knowledge. Specifically, how to deal with linear and nonlinear knowledge in MCLP and KMCLP models is the main concern of this paper. Numerical tests on the above models indicate that these models are effective in classifying data with prior knowledge.  相似文献   

5.
Dealing with the missing values is an important object in the field of data mining. Besides, the properties of compositional data lead to that traditional imputation methods may get undesirable result if they are directly used in this type of data. As a result, the management of missing values in compositional data is of great significant. To solve this problem, this paper uses the relationship between compositional data and Euclidean data, and proposes a new method based on Random Forest for missing values in compositional data. This method has been implemented and evaluated using both simulated and real-world databases, then the experimental results reveal that the new imputation method can be widely used in various types of data sets and has good performance than other methods.  相似文献   

6.
In this paper, an algorithm for finding piecewise linear boundaries between pattern classes is developed. This algorithm consists of two main stages. In the first stage, a polyhedral conic set is used to identify data points which lie inside their classes, and in the second stage we exclude those points to compute a piecewise linear boundary using the remaining data points. Piecewise linear boundaries are computed incrementally starting with one hyperplane. Such an approach allows one to significantly reduce the computational effort in many large data sets. Results of numerical experiments are reported. These results demonstrate that the new algorithm consistently produces a good test set accuracy on most data sets comparing with a number of other mainstream classifiers.  相似文献   

7.
New challenges in knowledge extraction include interpreting and classifying data sets while simultaneously considering related information to confirm results or identify false positives. We discuss a data fusion algorithmic framework targeted at this problem. It includes separate base classifiers for each data type and a fusion method for combining the individual classifiers. The fusion method is an extension of current ensemble classification techniques and has the advantage of allowing data to remain in heterogeneous databases. In this paper, we focus on the applicability of such a framework to the protein phosphorylation prediction problem.  相似文献   

8.
基于ARIMA和LSSVM的非线性集成预测模型   总被引:1,自引:0,他引:1  
针对复杂时间序列预测困难的问题,在综合考虑线性与非线性复合特征的基础上,提出一种基于ARIMA和最小二乘支持向量机(LSSVM)的非线性集成预测方法.首先采用ARIMA模型进行时间序列线性趋势建模,并为LSSVM建模确定输入阶数;接着根据确定的输入阶数进行时间序列样本重构,采用LSSVM模型进行时间序列非线性特征建模;最后采用基于LSSVM的非线性集成技术形成一个综合的预测结果.将该方法用于中国GDP预测取得的结果,与单独预测方法及流行的其他集成预测方法相比,预测精度有了较大的提高,从而验证了方法的有效性和可行性.  相似文献   

9.
During the last years, kernel based methods proved to be very successful for many real-world learning problems. One of the main reasons for this success is the efficiency on large data sets which is a result of the fact that kernel methods like support vector machines (SVM) are based on a convex optimization problem. Solving a new learning problem can now often be reduced to the choice of an appropriate kernel function and kernel parameters. However, it can be shown that even the most powerful kernel methods can still fail on quite simple data sets in cases where the inherent feature space induced by the used kernel function is not sufficient. In these cases, an explicit feature space transformation or detection of latent variables proved to be more successful. Since such an explicit feature construction is often not feasible for large data sets, the ultimate goal for efficient kernel learning would be the adaptive creation of new and appropriate kernel functions. It can, however, not be guaranteed that such a kernel function still leads to a convex optimization problem for Support Vector Machines. Therefore, we have to enhance the optimization core of the learning method itself before we can use it with arbitrary, i.e., non-positive semidefinite, kernel functions. This article motivates the usage of appropriate feature spaces and discusses the possible consequences leading to non-convex optimization problems. We will show that these new non-convex optimization SVM are at least as accurate as their quadratic programming counterparts on eight real-world benchmark data sets in terms of the generalization performance. They always outperform traditional approaches in terms of the original optimization problem. Additionally, the proposed algorithm is more generic than existing traditional solutions since it will also work for non-positive semidefinite or indefinite kernel functions.  相似文献   

10.
受推荐系统在电子商务领域重大经济利益的驱动,恶意用户以非法牟利为目的实施托攻击,操纵改变推荐结果,使推荐系统面临严峻的信息安全威胁,如何识别和检测托攻击成为保障推荐系统信息安全的关键。传统支持向量机(SVM)方法同时受到小样本和数据不均衡两个问题的制约。为此,提出一种半监督SVM和非对称集成策略相结合的托攻击检测方法。首先训练初始SVM,然后引入K最近邻法优化分类面附近样本的标记质量,利用标记数据和未标记数据的混合样本集减少对标记数据的需求。最后,设计一种非对称加权集成策略,重点关注攻击样本的分类准确率,降低集成分类器对数据不均衡的敏感性。实验结果表明,本文方法有效地解决了小样本问题和数据不均衡分布问题,获得了较好的检测效果。  相似文献   

11.
Many simple and complex methods have been developed to solve the classification problem. Boosting is one of the best known techniques for improving the accuracy of classifiers. However, boosting is prone to overfitting with noisy data and the final model is difficult to interpret. Some boosting methods, including AdaBoost, are also very sensitive to outliers. In this article we propose a new method, GA-Ensemble, which directly solves for the set of weak classifiers and their associated weights using a genetic algorithm. The genetic algorithm utilizes a new penalized fitness function that limits the number of weak classifiers and controls the effects of outliers by maximizing an appropriately chosen $p$ th percentile of margins. We compare the test set error rates of GA-Ensemble, AdaBoost, and GentleBoost (an outlier-resistant version of AdaBoost) using several artificial data sets and real-world data sets from the UC-Irvine Machine Learning Repository. GA-Ensemble is found to be more resistant to outliers and results in simpler predictive models than AdaBoost and GentleBoost.  相似文献   

12.
Previous studies on financial distress prediction (FDP) almost construct FDP models based on a balanced data set, or only use traditional classification methods for FDP modelling based on an imbalanced data set, which often results in an overestimation of an FDP model’s recognition ability for distressed companies. Our study focuses on support vector machine (SVM) methods for FDP based on imbalanced data sets. We propose a new imbalance-oriented SVM method that combines the synthetic minority over-sampling technique (SMOTE) with the Bagging ensemble learning algorithm and uses SVM as the base classifier. It is named as SMOTE-Bagging-based SVM-ensemble (SB-SVM-ensemble), which is theoretically more effective for FDP modelling based on imbalanced data sets with limited number of samples. For comparative study, the traditional SVM method as well as three classical imbalance-oriented SVM methods such as cost-sensitive SVM, SMOTE-SVM, and data-set-partition-based SVM-ensemble are also introduced. We collect an imbalanced data set for FDP from the Chinese publicly traded companies, and carry out 100 experiments to empirically test its effectiveness. The experimental results indicate that the new SB-SVM-ensemble method outperforms the traditional methods and is a useful tool for imbalanced FDP modelling.  相似文献   

13.
14.
This paper gives an overview of the eigenvalue problems encountered in areas of data mining that are related to dimension reduction. Given some input high‐dimensional data, the goal of dimension reduction is to map them to a low‐dimensional space such that certain properties of the original data are preserved. Optimizing these properties among the reduced data can be typically posed as a trace optimization problem that leads to an eigenvalue problem. There is a rich variety of such problems and the goal of this paper is to unravel relationships between them as well as to discuss effective solution techniques. First, we make a distinction between projective methods that determine an explicit linear mapping from the high‐dimensional space to the low‐dimensional space, and nonlinear methods where the mapping between the two is nonlinear and implicit. Then, we show that all the eigenvalue problems solved in the context of explicit linear projections can be viewed as the projected analogues of the nonlinear or implicit projections. We also discuss kernels as a means of unifying linear and nonlinear methods and revisit some of the equivalences between methods established in this way. Finally, we provide some illustrative examples to showcase the behavior and the particular characteristics of the various dimension reduction techniques on real‐world data sets. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

15.
Mathematical programming (MP) discriminant analysis models can be used to develop classification models for assigning observations of unknown class membership to one of a number of specified classes using values of a set of features associated with each observation. Since most MP discriminant analysis models generate linear discriminant functions, these MP models are generally used to develop linear classification models. Nonlinear classifiers may, however, have better classification performance than linear classifiers. In this paper, a mixed integer programming model is developed to generate nonlinear discriminant functions composed of monotone piecewise-linear marginal utility functions for each feature and the cut-off value for class membership. It is also shown that this model can be extended for feature selection. The performance of this new MP model for two-group discriminant analysis is compared with statistical discriminant analysis and other MP discriminant analysis models using a real problem and a number of simulated problem sets.  相似文献   

16.
探讨了基于相空间重构的局部线性映射算法在非线性时间序列降噪技术中的应用,并给出了算法中主要参数的选取方法.实验结果表明,该算法的降噪效果明显优于传统的线性信号滤波技术.并且针对多数实测数据的原始动态模型未知的特点,提出通过计算降噪前后时序信号的关联维数作为评判降噪效果的工具,克服了已有方法中无法计算该类时序信号降噪水平的缺点.  相似文献   

17.
One issue in data classification problems is to find an optimal subset of instances to train a classifier. Training sets that represent well the characteristics of each class have better chances to build a successful predictor. There are cases where data are redundant or take large amounts of computing time in the learning process. To overcome this issue, instance selection techniques have been proposed. These techniques remove examples from the data set so that classifiers are built faster and, in some cases, with better accuracy. Some of these techniques are based on nearest neighbors, ordered removal, random sampling and evolutionary methods. The weaknesses of these methods generally involve lack of accuracy, overfitting, lack of robustness when the data set size increases and high complexity. This work proposes a simple and fast immune-inspired suppressive algorithm for instance selection, called SeleSup. According to self-regulation mechanisms, those cells unable to neutralize danger tend to disappear from the organism. Therefore, by analogy, data not relevant to the learning of a classifier are eliminated from the training process. The proposed method was compared with three important instance selection algorithms on a number of data sets. The experiments showed that our mechanism substantially reduces the data set size and is accurate and robust, specially on larger data sets.  相似文献   

18.
This paper presents a novel knowledge-based linear classification model for multi-category discrimination of sets or objects with prior knowledge. The prior knowledge is in the form of multiple polyhedral sets belonging to one or more categories or classes and it is introduced as additional constraints into the formulation of the Tikhonov linear least squares multi-class support vector machine model. The resulting formulation leads to a least squares problem that can be solved using matrix methods or iterative methods. Investigations include the development of a linear knowledge-based classification model extended to the case of multi-categorical discrimination and expressed as a single unconstrained optimization problem. Advantages of this formulation include explicit expressions for the classification weights of the classifier(s) and its ability to incorporate and handle prior knowledge directly to the classifiers. In addition it can provide fast solutions to the optimal classification weights for multi-categorical separation without the use of specialized solver-software. To evaluate the model, data and prior knowledge from the Wisconsin breast cancer prognosis and two-phase flow regimes in pipes were used to train and test the proposed formulation.  相似文献   

19.
This paper investigates the performance of evolutionary algorithms in the optimization aspects of oblique decision tree construction and describes their performance with respect to classification accuracy, tree size, and Pareto-optimality of their solution sets. The performance of the evolutionary algorithms is analyzed and compared to the performance of exhaustive (traditional) decision tree classifiers on several benchmark datasets. The results show that the classification accuracy and tree sizes generated by the evolutionary algorithms are comparable with the results generated by traditional methods in all the sample datasets and in the large datasets, the multiobjective evolutionary algorithms generate better Pareto-optimal sets than the sets generated by the exhaustive methods. The results also show that a classifier, whether exhaustive or evolutionary, that generates the most accurate trees does not necessarily generate the shortest trees or the best Pareto-optimal sets.  相似文献   

20.
In machine learning problems, the availability of several classifiers trained on different data or features makes the combination of pattern classifiers of great interest. To combine distinct sources of information, it is necessary to represent the outputs of classifiers in a common space via a transformation called calibration. The most classical way is to use class membership probabilities. However, using a single probability measure may be insufficient to model the uncertainty induced by the calibration step, especially in the case of few training data. In this paper, we extend classical probabilistic calibration methods to the evidential framework. Experimental results from the calibration of SVM classifiers show the interest of using belief functions in classification problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号