首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Random forest (RF) methodology is a nonparametric methodology for prediction problems. A standard way to use RFs includes generating a global RF to predict all test cases of interest. In this article, we propose growing different RFs specific to different test cases, namely case-specific random forests (CSRFs). In contrast to the bagging procedure in the building of standard RFs, the CSRF algorithm takes weighted bootstrap resamples to create individual trees, where we assign large weights to the training cases in close proximity to the test case of interest a priori. Tuning methods are discussed to avoid overfitting issues. Both simulation and real data examples show that the weighted bootstrap resampling used in CSRF construction can improve predictions for specific cases. We also propose a new case-specific variable importance (CSVI) measure as a way to compare the relative predictor variable importance for predicting a particular case. It is possible that the idea of building a predictor case-specifically can be generalized in other areas.  相似文献   

2.
RNA-sample pooling is sometimes inevitable, but should be avoided in classification tasks like biomarker studies. Our simulation framework investigates a two-class classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs. The results show how the effects of pooling depend on pool size, discriminating pattern, number of informative features and the statistical learning method used (support vector machines with linear and radial kernel, random forest (RF), linear discriminant analysis, powered partial least squares discriminant analysis (PPLS-DA) and partial least squares discriminant analysis (PLS-DA)). As a measure for the pooling effect, we consider prediction error (PE) and the coincidence of important feature sets for classification based on PLS-DA, PPLS-DA and RF. In general, PPLS-DA and PLS-DA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class. The coincidence of important feature sets is larger for PLS-DA and PPLS-DA as it is for RF. RF shows the best results for patterns in which the convex hull of one class is a cover of the other class, but these depend strongly on the pool size. We complete the PE results with experimental data which we pool artificially. The PE of PPLS-DA and PLS-DA are again least influenced by pooling and are low. Additionally, we show under which assumption the PLS-DA loading weights, as a measure for importance of features regarding classification, are equal for the different designs.  相似文献   

3.
We consider linear programming approaches for support vector machines (SVM). The linear programming problems are introduced as an approximation of the quadratic programming problems commonly used in SVM. When we consider the kernel based nonlinear discriminators, the approximation can be viewed as kernel principle component analysis which generates an important subspace from the feature space characterized the kernel function. We show that any data points nonlinearly, and implicitly, projected into the feature space by kernel functions can be approximately expressed as points lying a low dimensional Euclidean space explicitly, which enables us to develop linear programming formulations for nonlinear discriminators. We also introduce linear programming formulations for multicategory classification problems. We show that the same maximal margin principle exploited in SVM can be involved into the linear programming formulations. Moreover, considering the low dimensional feature subspace extraction, we can generate nonlinear multicategory discriminators by solving linear programming problems.Numerical experiments on real world datasets are presented. We show that the fairly low dimensional feature subspace can achieve a reasonable accuracy, and that the linear programming formulations calculate discriminators efficiently. We also discuss a sampling strategy which might be crucial for huge datasets.  相似文献   

4.
The curse of dimensionality is based on the fact that high dimensional data is often difficult to work with. A large number of features can increase the noise of the data and thus the error of a learning algorithm. Feature selection is a solution for such problems where there is a need to reduce the data dimensionality. Different feature selection algorithms may yield feature subsets that can be considered local optima in the space of feature subsets. Ensemble feature selection combines independent feature subsets and might give a better approximation to the optimal subset of features. We propose an ensemble feature selection approach based on feature selectors’ reliability assessment. It aims at providing a unique and stable feature selection without ignoring the predictive accuracy aspect. A classification algorithm is used as an evaluator to assign a confidence to features selected by ensemble members based on their associated classification performance. We compare our proposed approach to several existing techniques and to individual feature selection algorithms. Results show that our approach often improves classification performance and feature selection stability for high dimensional data sets.  相似文献   

5.
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.  相似文献   

6.
In this paper we use counting arguments to prove that the expected percentage coverage of a d dimensional parameter space of size n when performing k trials with either Latin Hypercube sampling or Orthogonal Array-based Latin Hypercube sampling is the same. We then extend these results to an experimental design setting by projecting onto a t < d dimensional subspace. These results are confirmed by simulations. The theory presented has both theoretical and practical significance in modelling and simulation science when sampling over high dimensional spaces.  相似文献   

7.
张向荣 《运筹与管理》2021,30(1):184-191
财务指标的异构性是影响企业财务困境预测精度的重要因素,现有多核学习方法能够用于解决异构数据学习问题。本文首先介绍了子空间多核学习财务困境预测理论框架,在此基础上根据子空间学习的最大化方差准则、类别可分性最大化准则、非线性子空间映射原理,提出了三种子空间多核学习方法,分别为最大化方差投影子空间多核学习、类别可分性最大化子空间多核学习、非线性子空间多核学习。利用采集的我国上市公司数据进行实验,对比所提出的方法同现有代表性财务困境预测方法,并对实验结果进行分析。实验结果表明,本文提出的子空间多核学习财务困境预测框架行之有效,该框架下所构造的子空间多核学习预测方法能够有效地提升财务困境预测精度。  相似文献   

8.
针对同一对象从不同途径或不同层面获得的特征数据被称为多视角数据. 多视角学习是利用事物的多视角数据进行建模求解的一种新的机器学习方法. 大量研究表明, 多视角数据共同学习可以显著提高模型的学习效果, 因此许多相关模型及算法被提出. 多视角学习一般需遵循一 致性原则和互补性原则. 基于一致性原则,Farquhar 等人成功地将支持向量机(Support Vector Machine, SVM)和核典型相关分析(Kernel Canonical Correlation Analysis, KCCA)整合成一个单独的优化问题, 提出SVM-2K模型. 但是, SVM-2K模型并未充分利用多视角数据间的互补信息. 因此, 在SVM-2K模型的基础之上, 提出了基于间隔迁移的多视角支持向量机模型(Margin transfer-based multi-view support vector machine, M^2SVM), 该模型同时满足多视角学习的一致性和互补 性两原则. 进一步地, 从一致性的角度对其进行理论分析, 并 与SVM-2K比较, 揭示了 M^2SVM 比SVM-2K 更为灵活. 最后, 在大量的多视角数据集上验证了M^2SVM模型的有效性.  相似文献   

9.
Principal component analysis (PCA) is a canonical tool that reduces data dimensionality by finding linear transformations that project the data into a lower dimensional subspace while preserving the variability of the data. Selecting the number of principal components (PC) is essential but challenging for PCA since it represents an unsupervised learning problem without a clear target label at the sample level. In this article, we propose a new method to determine the optimal number of PCs based on the stability of the space spanned by PCs. A series of analyses with both synthetic data and real data demonstrates the superior performance of the proposed method.  相似文献   

10.
个性化试题推荐、试题难度预测、学习者建模等教育数据挖掘任务需要使用到学生作答数据资源及试题知识点标注,现阶段的试题数据都是由人工标注知识点。因此,利用机器学习方法自动标注试题知识点是一项迫切的需求。针对海量试题资源情况下的试题知识点自动标注问题,本文提出了一种基于集成学习的试题多知识点标注方法。首先,形式化定义了试题知识点标注问题,并借助教材目录和领域知识构建知识点的知识图谱作为类别标签。其次,采用基于集成学习的方法训练多个支持向量机作为基分类器,筛选出表现优异的基分类器进行集成,构建出试题多知识点标注模型。最后,以某在线教育平台数据库中的高中数学试题为实验数据集,应用所提方法预测试题考察的知识点,取得了较好的效果。  相似文献   

11.
Techniques for machine learning have been extensively studied in recent years as effective tools in data mining. Although there have been several approaches to machine learning, we focus on the mathematical programming (in particular, multi-objective and goal programming; MOP/GP) approaches in this paper. Among them, Support Vector Machine (SVM) is gaining much popularity recently. In pattern classification problems with two class sets, its idea is to find a maximal margin separating hyperplane which gives the greatest separation between the classes in a high dimensional feature space. This task is performed by solving a quadratic programming problem in a traditional formulation, and can be reduced to solving a linear programming in another formulation. However, the idea of maximal margin separation is not quite new: in the 1960s the multi-surface method (MSM) was suggested by Mangasarian. In the 1980s, linear classifiers using goal programming were developed extensively.This paper presents an overview on how effectively MOP/GP techniques can be applied to machine learning such as SVM, and discusses their problems.  相似文献   

12.
Supervised classification learning can be considered as an important tool for decision support. In this paper, we present a method for supervised classification learning, which ensembles decision trees obtained via convex sets of probability distributions (also called credal sets) and uncertainty measures. Our method forces the use of different decision trees and it has mainly the following characteristics: it obtains a good percentage of correct classifications and an improvement in time of processing compared with known classification methods; it not needs to fix the number of decision trees to be used; and it can be parallelized to apply it on very large data sets.  相似文献   

13.
Digital soil mapping (DSM) increasingly makes use of machine learning algorithms to identify relationships between soil properties and multiple covariates that can be detected across landscapes. Selecting the appropriate algorithm for model building is critical for optimizing results in the context of the available data. Over the past decade, many studies have tested different machine learning (ML) approaches on a variety of soil data sets. Here, we review the application of some of the most popular ML algorithms for digital soil mapping. Specifically, we compare the strengths and weaknesses of multiple linear regression (MLR), k-nearest neighbors (KNN), support vector regression (SVR), Cubist, random forest (RF), and artificial neural networks (ANN) for DSM. These algorithms were compared on the basis of five factors: (1) quantity of hyperparameters, (2) sample size, (3) covariate selection, (4) learning time, and (5) interpretability of the resulting model. If training time is a limitation, then algorithms that have fewer model parameters and hyperparameters should be considered, e.g., MLR, KNN, SVR, and Cubist. If the data set is large (thousands of samples) and computation time is not an issue, ANN would likely produce the best results. If the data set is small (<100), then Cubist, KNN, RF, and SVR are likely to perform better than ANN and MLR. The uncertainty in predictions produced by Cubist, KNN, RF, and SVR may not decrease with large datasets. When interpretability of the resulting model is important to the user, Cubist, MLR, and RF are more appropriate algorithms as they do not function as “black boxes.” There is no one correct approach to produce models for predicting the spatial distribution of soil properties. Nonetheless, some algorithms are more appropriate than others considering the nature of the data and purpose of mapping activity.  相似文献   

14.
During the last years, kernel based methods proved to be very successful for many real-world learning problems. One of the main reasons for this success is the efficiency on large data sets which is a result of the fact that kernel methods like support vector machines (SVM) are based on a convex optimization problem. Solving a new learning problem can now often be reduced to the choice of an appropriate kernel function and kernel parameters. However, it can be shown that even the most powerful kernel methods can still fail on quite simple data sets in cases where the inherent feature space induced by the used kernel function is not sufficient. In these cases, an explicit feature space transformation or detection of latent variables proved to be more successful. Since such an explicit feature construction is often not feasible for large data sets, the ultimate goal for efficient kernel learning would be the adaptive creation of new and appropriate kernel functions. It can, however, not be guaranteed that such a kernel function still leads to a convex optimization problem for Support Vector Machines. Therefore, we have to enhance the optimization core of the learning method itself before we can use it with arbitrary, i.e., non-positive semidefinite, kernel functions. This article motivates the usage of appropriate feature spaces and discusses the possible consequences leading to non-convex optimization problems. We will show that these new non-convex optimization SVM are at least as accurate as their quadratic programming counterparts on eight real-world benchmark data sets in terms of the generalization performance. They always outperform traditional approaches in terms of the original optimization problem. Additionally, the proposed algorithm is more generic than existing traditional solutions since it will also work for non-positive semidefinite or indefinite kernel functions.  相似文献   

15.
Automatic construction of decision trees for classification   总被引:1,自引:0,他引:1  
An algorithm for learning decision trees for classification and prediction is described which converts real-valued attributes into intervals using statistical considerations. The trees are automatically pruned with the help of a threshold for the estimated class probabilities in an interval. By means of this threshold the user can control the complexity of the tree, i.e. the degree of approximation of class regions in feature space. Costs can be included in the learning phase if a cost matrix is given. In this case class dependent thresholds are used.Some applications are described, especially the task of predicting the high water level in a mountain river.  相似文献   

16.
孙永明  杨进 《经济数学》2020,37(4):148-158
针对目前心理压力问题比较严重,收集生理数据评估心理压力存在成本高、主观性强等问题,提出了一种新的基于手机数据的压力评估方法BSTL+XGDT(Borderline1 SMOTE Tomeklinks+eXtreme Gradient Boosting),将压力水平精确划分为5个级别.首先从手机数据提取特征生成样本,对样本进行BSTL采样,然后用XGDT过滤特征和RFE(Recursive Feature Elimination)筛选特征,同时,利用采样前后的数据及特征筛选前后的数据训练XGDT、支持向量机(SVC)、随机森林(RF)、K近邻(KNN)、决策树(DT)、多层感知机(MLP)、标签传播(LS)方法,结果显示方法BSTL+XGDT优于其他方法.  相似文献   

17.
有限样本的子空间数据聚类建模及其大规模计算是子空间学习面临的主要问题.现有的大多数模型都不适合大规模计算.本文提出了一个新的优化模型,结合谱投影反馈和辅助信息优化.在提升模型的学习能力的同时,采用高效的分片符号更新算法,可以适合大规模计算.我们用较大规模的模拟例子和实际例子,分析检验了新的优化模型及其快速算法的优于现有其他模型与算法的有效性.  相似文献   

18.
Dimensionality reduction is used to preserve significant properties of data in a low-dimensional space. In particular, data representation in a lower dimension is needed in applications, where information comes from multiple high dimensional sources. Data integration, however, is a challenge in itself.In this contribution, we consider a general framework to perform dimensionality reduction taking into account that data are heterogeneous. We propose a novel approach, called Deep Kernel Dimensionality Reduction which is designed for learning layers of new compact data representations simultaneously. The method can be also used to learn shared representations between modalities. We show by experiments on standard and on real large-scale biomedical data sets that the proposed method embeds data in a new compact meaningful representation, and leads to a lower classification error compared to the state-of-the-art methods.  相似文献   

19.
The feature selection consists of obtaining a subset of these features to optimally realize the task without the irrelevant ones. Since it can provide faster and cost-effective learning machines and also improve the prediction performance of the predictors, it is a crucial step in machine learning. The feature selection methods using support machines have obtained satisfactory results, but the noises and outliers often reduce the performance. In this paper, we propose a feature selection approach using fuzzy support vector machines and compare it with the previous work, the results of experiments on the UCI data sets show that feature selection using fuzzy SVM obtains better results than using SVM.  相似文献   

20.
Given a set of vectors (the data) in a Hilbert space ?, we prove the existence of an optimal collection of subspaces minimizing the sum of the square of the distances between each vector and its closest subspace in the collection. This collection of subspaces gives the best sparse representation for the given data, in a sense defined in the paper, and provides an optimal model for sampling in union of subspaces. The results are proved in a general setting and then applied to the case of low dimensional subspaces of ? N and to infinite dimensional shift-invariant spaces in L 2(? d ). We also present an iterative search algorithm for finding the solution subspaces. These results are tightly connected to the new emergent theories of compressed sensing and dictionary design, signal models for signals with finite rate of innovation, and the subspace segmentation problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号