首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.
Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.  相似文献   

2.
In the Knowledge Discovery Process, classification algorithms are often used to help create models with training data that can be used to predict the classes of untested data instances. While there are several factors involved with classification algorithms that can influence classification results, such as the node splitting measures used in making decision trees, feature selection is often used as a pre-classification step when using large data sets to help eliminate irrelevant or redundant attributes in order to increase computational efficiency and possibly to increase classification accuracy. One important factor common to both feature selection as well as to classification using decision trees is attribute discretization, which is the process of dividing attribute values into a smaller number of discrete values. In this paper, we will present and explore a new hybrid approach, ChiBlur, which involves the use of concepts from both the blurring and χ2-based approaches to feature selection, as well as concepts from multi-objective optimization. We will compare this new algorithm with algorithms based on the blurring and χ2-based approaches.  相似文献   

3.
Obtaining reliable estimates of the parameters of a probabilistic classification model is often a challenging problem because the amount of available training data is limited. In this paper, we present a classification approach based on belief functions that makes the uncertainty resulting from limited amounts of training data explicit and thereby improves classification performance. In addition, we model classification as an active information acquisition problem where features are sequentially selected by maximizing the expected information gain with respect to the current belief distribution, thus reducing uncertainty as quickly as possible. For this, we consider different measures of uncertainty for belief functions and provide efficient algorithms for computing them. As a result, only a small subset of features need to be extracted without negatively impacting the recognition rate. We evaluate our approach on an object recognition task where we compare different evidential and Bayesian methods for obtaining likelihoods from training data and we investigate the influence of different uncertainty measures on the feature selection process.  相似文献   

4.
The presence of less relevant or highly correlated features often decrease classification accuracy. Feature selection in which most informative variables are selected for model generation is an important step in data-driven modeling. In feature selection, one often tries to satisfy multiple criteria such as feature discriminating power, model performance or subset cardinality. Therefore, a multi-objective formulation of the feature selection problem is more appropriate. In this paper, we propose to use fuzzy criteria in feature selection by using a fuzzy decision making framework. This formulation allows for a more flexible definition of the goals in feature selection, and avoids the problem of weighting different goals is classical multi-objective optimization. The optimization problem is solved using an ant colony optimization algorithm proposed in our previous work. We illustrate the added value of the approach by applying our proposed fuzzy feature selection algorithm to eight benchmark problems.  相似文献   

5.
This paper investigates the feature subset selection problem for the binary classification problem using logistic regression model. We developed a modified discrete particle swarm optimization (PSO) algorithm for the feature subset selection problem. This approach embodies an adaptive feature selection procedure which dynamically accounts for the relevance and dependence of the features included the feature subset. We compare the proposed methodology with the tabu search and scatter search algorithms using publicly available datasets. The results show that the proposed discrete PSO algorithm is competitive in terms of both classification accuracy and computational performance.  相似文献   

6.
Feature reduction based on rough set theory is an effective feature selection method in pattern recognition applications. Finding a minimal subset of the original features is inherent in rough set approach to feature selection. As feature reduction is a Nondeterministic Polynomial‐time‐hard problem, it is necessary to develop fast optimal or near‐optimal feature selection algorithms. This article aims to propose an exact feature selection algorithm in rough set that is efficient in terms of computation time. The proposed algorithm begins the examination of a solution tree by a breadth‐first strategy. The pruned nodes are held in a version of the trie data structure. Based on the monotonic property of dependency degree, all subsets of the pruned nodes cannot be optimal solutions. Thus, by detecting these subsets in trie, it is not necessary to calculate their dependency degree. The search on the tree continues until the optimal solution is found. This algorithm is improved by selecting an initial search level determined by the hill‐climbing method instead of searching the tree from the level below the root. The length of the minimal reduct and the size of data set can influence which starting search level is more efficient. The experimental results using some of the standard UCI data sets, demonstrate that the proposed algorithm is effective and efficient for data sets with more than 30 features. © 2014 Wiley Periodicals, Inc. Complexity 20: 50–62, 2015  相似文献   

7.
In this paper, four methods are proposed for feature selection in an unsupervised manner by using genetic algorithms. The proposed methods do not use the class label information but select a set of features using a task independent criterion that can preserve the geometric structure (topology) of the original data in the reduced feature space. One of the components of the fitness function is Sammon’s stress function which tries to preserve the topology of the high dimensional data when reduced into the lower dimensional one. In this context, in addition to using a fitness criterion, we also explore the utility of unfitness criterion to select chromosomes for genetic operations. This ensures higher diversity in the population and helps unfit chromosomes to become more fit. We use four different ways for evaluation of the quality of the features selected: Sammon error, correlation between the inter-point distances in the two spaces, a measure of preservation of cluster structure found in the original and reduced spaces and a classifier performance. The proposed methods are tested on six real data sets with dimensionality varying between 9 and 60. The selected features are found to be excellent in terms of preservation topology (inter-point geometry), cluster structure and classifier performance. We do not compare our methods with other methods because, unlike other methods, using four different ways we check the quality of the selected features by finding how well the selected features preserve the “structure” of the original data.  相似文献   

8.
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.  相似文献   

9.
Each clustering algorithm usually optimizes a qualification metric during its progress. The qualification metric in conventional clustering algorithms considers all the features equally important; in other words each feature participates in the clustering process equivalently. It is obvious that some features have more information than others in a dataset. So it is highly likely that some features should have lower importance degrees during a clustering or a classification algorithm; due to their lower information or their higher variances and etc. So it is always a desire for all artificial intelligence communities to enforce the weighting mechanism in any task that identically uses a number of features to make a decision. But there is always a certain problem of how the features can be participated in the clustering process (in any algorithm, but especially in clustering algorithm) in a weighted manner. Recently, this problem is dealt with by locally adaptive clustering (LAC). However, like its traditional competitors the LAC suffers from inefficiency in data with imbalanced clusters. This paper solves the problem by proposing a weighted locally adaptive clustering (WLAC) algorithm that is based on the LAC algorithm. However, WLAC algorithm suffers from sensitivity to its two parameters that should be tuned manually. The performance of WLAC algorithm is affected by well-tuning of its parameters. Paper proposes two solutions. The first is based on a simple clustering ensemble framework to examine the sensitivity of the WLAC algorithm to its manual well-tuning. The second is based on cluster selection method.  相似文献   

10.
The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches. Zehua Chen was supported by Singapore Ministry of Educations ACRF Tier 1 (Grant No. R-155-000-065-112). Jiahua Chen was supported by the National Science and Engineering Research Countil of Canada and MITACS, Canada.  相似文献   

11.
The performance of kernel-based method, such as support vector machine (SVM), is greatly affected by the choice of kernel function. Multiple kernel learning (MKL) is a promising family of machine learning algorithms and has attracted many attentions in recent years. MKL combines multiple sub-kernels to seek better results compared to single kernel learning. In order to improve the efficiency of SVM and MKL, in this paper, the Kullback–Leibler kernel function is derived to develop SVM. The proposed method employs an improved ensemble learning framework, named KLMKB, which applies Adaboost to learning multiple kernel-based classifier. In the experiment for hyperspectral remote sensing image classification, we employ feature selected through Optional Index Factor (OIF) to classify the satellite image. We extensively examine the performance of our approach in comparison to some relevant and state-of-the-art algorithms on a number of benchmark classification data sets and hyperspectral remote sensing image data set. Experimental results show that our method has a stable behavior and a noticeable accuracy for different data set.  相似文献   

12.
In this paper, an improved Feature Extraction Method (FEM), which selects discriminative feature sets able to lead to high classification rates in pattern recognition tasks, is presented. The resulted features are the wavelet coefficients of an improved compressed signal, consisting of the Zernike moments amplitudes. By applying a straightforward methodology, it is aimed to construct optimal feature vectors in the sense of vector dimensionality and information content for classification purposes. The resulting surrogate feature vector is of lower dimensionality than the original Zernike moment feature vector and thus more appropriate for pattern recognition tasks.Appropriate validation tests have been arranged, in order to investigate the performance of the proposed algorithm by measuring the discriminative power of the new feature vectors despite the information loss.  相似文献   

13.
More and more high dimensional data are widely used in many real world applications. This kind of data are obtained from different feature extractors, which represent distinct perspectives of the data. How to classify such data efficiently is a challenge. Despite of existence of millions of unlabeled data samples, it is believed that labeling a handful of data such as the semisupervised scheme will remarkably improve the searching performance. However, the performance of semisupervised data classification highly relies on proposed models and related numerical methods. Following from the extension of the Mumford–Shah–Potts-type model in the spatially continuous setting, we propose some efficient data classification algorithms based on the alternating direction method of multipliers and the primal-dual method to efficiently deal with the nonsmoothing problem in the proposed model. The convergence of the proposed data classification algorithms is established under the framework of variational inequalities. Some balanced and unbalanced classification problems are tested, which demonstrate the efficiency of the proposed algorithms.  相似文献   

14.
In developing a classification model for assigning observations of unknown class to one of a number of specified classes using the values of a set of features associated with each observation, it is often desirable to base the classifier on a limited number of features. Mathematical programming discriminant analysis methods for developing classification models can be extended for feature selection. Classification accuracy can be used as the feature selection criterion by using a mixed integer programming (MIP) model in which a binary variable is associated with each training sample observation, but the binary variable requirements limit the size of problems to which this approach can be applied. Heuristic feature selection methods for problems with large numbers of observations are developed in this paper. These heuristic procedures, which are based on the MIP model for maximizing classification accuracy, are then applied to three credit scoring data sets.  相似文献   

15.
A Gaussian kernel approximation algorithm for a feedforward neural network is presented. The approach used by the algorithm, which is based on a constructive learning algorithm, is to create the hidden units directly so that automatic design of the architecture of neural networks can be carried out. The algorithm is defined using the linear summation of input patterns and their randomized input weights. Hidden-layer nodes are defined so as to partition the input space into homogeneous regions, where each region contains patterns belonging to the same class. The largest region is used to define the center of the corresponding Gaussian hidden nodes. The algorithm is tested on three benchmark data sets of different dimensionality and sample sizes to compare the approach presented here with other algorithms. Real medical diagnoses and a biological classification of mushrooms are used to illustrate the performance of the algorithm. These results confirm the effectiveness of the proposed algorithm.  相似文献   

16.
17.
The aim of this article is to develop a supervised dimension-reduction framework, called spatially weighted principal component analysis (SWPCA), for high-dimensional imaging classification. Two main challenges in imaging classification are the high dimensionality of the feature space and the complex spatial structure of imaging data. In SWPCA, we introduce two sets of novel weights, including global and local spatial weights, which enable a selective treatment of individual features and incorporation of the spatial structure of imaging data and class label information. We develop an efficient two-stage iterative SWPCA algorithm and its penalized version along with the associated weight determination. We use both simulation studies and real data analysis to evaluate the finite-sample performance of our SWPCA. The results show that SWPCA outperforms several competing principal component analysis (PCA) methods, such as supervised PCA (SPCA), and other competing methods, such as sparse discriminant analysis (SDA).  相似文献   

18.
In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.  相似文献   

19.
在Bagging算法基础上,运用马田系统进行特征选择,形成双重扰动改善神经网络集成的分类性能.实验表明,双重扰动增加了集成网络个体精度和差异度,基于MTS-Bagging算法的分类性能相比于Bagging有明显提高.  相似文献   

20.
Nonlinear manifold learning algorithms, such as diffusion maps, have been fruitfully applied in recent years to the analysis of large and complex data sets. However, such algorithms still encounter challenges when faced with real data. One such challenge is the existence of “repeated eigendirections,” which obscures the detection of the true dimensionality of the underlying manifold and arises when several embedding coordinates parametrize the same direction in the intrinsic geometry of the data set. We propose an algorithm, based on local linear regression, to automatically detect coordinates corresponding to repeated eigendirections. We construct a more parsimonious embedding using only the eigenvectors corresponding to unique eigendirections, and we show that this reduced diffusion maps embedding induces a metric which is equivalent to the standard diffusion distance. We first demonstrate the utility and flexibility of our approach on synthetic data sets. We then apply our algorithm to data collected from a stochastic model of cellular chemotaxis, where our approach for factoring out repeated eigendirections allows us to detect changes in dynamical behavior and the underlying intrinsic system dimensionality directly from data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号