首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The main challenge in working with gene expression microarrays is that the sample size is small compared to the large number of variables (genes). In many studies, the main focus is on finding a small subset of the genes, which are the most important ones for differentiating between different types of cancer, for simpler and cheaper diagnostic arrays. In this paper, a sparse Bayesian variable selection method in probit model is proposed for gene selection and classification. We assign a sparse prior for regression parameters and perform variable selection by indexing the covariates of the model with a binary vector. The correlation prior for the binary vector assigned in this paper is able to distinguish models with the same size. The performance of the proposed method is demonstrated with one simulated data and two well known real data sets, and the results show that our method is comparable with other existing methods in variable selection and classification.  相似文献   

2.
基于贝叶斯统计方法的两总体基因表达数据分类   总被引:1,自引:0,他引:1  
在疾病的诊断过程中,对疾病的精确分类是提高诊断准确率和疾病治愈率至 关重要的一个环节,DNA芯片技术的出现使得我们从微观的层次获得与疾病分类及诊断 密切相关的基因功能信息.但是DNA芯片技术得到的基因的表达模式数据具有多变量小 样本特点,使得分类过程极不稳定,因此我们首先筛选出表达模式发生显著性变化的基因 作为特征基因集合以减少变量个数,然后再根据此特征基因集合建立分类器对样本进行分 类.本文运用似然比检验筛选出特征基因,然后基于贝叶斯方法建立了统计分类模型,并 应用马尔科夫链蒙特卡罗(MCMC)抽样方法计算样本归类后验概率.最后我们将此模型 应用到两组真实的DNA芯片数据上,并将样本成功分类.  相似文献   

3.
Advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene function, but they also present the challenge of analyzing data with a large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. We address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression, based on a previous approach, iteratively reweighted partial least squares, that is, IRWPLS. We compare our results with two-stage PLS and with other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying Firth's procedure to avoid (quasi)separation, we often get lower classification error rates.  相似文献   

4.
GLRT和LS_SVM应用于基因表达数据分类   总被引:1,自引:0,他引:1  
为快速、准确地对基因芯片表达数据进行分类,提出了一种新型的基因芯片表达数据分类模型.该模型首先使用广义似然比检验(GLRT)有效鉴别出表达有显著性差异的基因.然后,将这些表达有显著性差异的基因用于最小二乘支持向量机(LS_SVM)的训练,从而建立了基于GLRT+LS_SVM的基因芯片表达数据分类模型.该模型在处理数据量大、维数高、样本量小、非线性等特点的基因芯片数据时有很大优势,可以广泛用于处理基因芯片数据.  相似文献   

5.
Feature Selection (FS) is an important pre-processing step in data mining and classification tasks. The aim of FS is to select a small subset of most important and discriminative features. All the traditional feature selection methods assume that the entire input feature set is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with time as new features stream in. A critical challenge for online streaming feature selection (OSFS) is the unavailability of the entire feature set before learning starts. Several efforts have been made to address the OSFS problem, however they all need some prior knowledge about the entire feature space to select informative features. In this paper, the OSFS problem is considered from the rough sets (RS) perspective and a new OSFS algorithm, called OS-NRRSAR-SA, is proposed. The main motivation for this consideration is that RS-based data mining does not require any domain knowledge other than the given dataset. The proposed algorithm uses the classical significance analysis concepts in RS theory to control the unknown feature space in OSFS problems. This algorithm is evaluated extensively on several high-dimensional datasets in terms of compactness, classification accuracy, run-time, and robustness against noises. Experimental results demonstrate that the algorithm achieves better results than existing OSFS algorithms, in every way.  相似文献   

6.
More and more high dimensional data are widely used in many real world applications. This kind of data are obtained from different feature extractors, which represent distinct perspectives of the data. How to classify such data efficiently is a challenge. Despite of existence of millions of unlabeled data samples, it is believed that labeling a handful of data such as the semisupervised scheme will remarkably improve the searching performance. However, the performance of semisupervised data classification highly relies on proposed models and related numerical methods. Following from the extension of the Mumford–Shah–Potts-type model in the spatially continuous setting, we propose some efficient data classification algorithms based on the alternating direction method of multipliers and the primal-dual method to efficiently deal with the nonsmoothing problem in the proposed model. The convergence of the proposed data classification algorithms is established under the framework of variational inequalities. Some balanced and unbalanced classification problems are tested, which demonstrate the efficiency of the proposed algorithms.  相似文献   

7.
The study of genetic properties of a disease requires the collection of information concerning the subjects in a set of pedigrees. The main focus of this study was the detection of susceptible genes. However, even with large pedigrees, the heterogeneity of phenotypes in complex diseases such as Schizophrenia, Bipolar and Autism, makes the detection of susceptible genes difficult to accomplish. This is mainly due to a genetic heterogeneity: many genes phenomena are involved in the disease. In order to reduce this heterogeneity, our idea consists in sub-typing the disease and in partitioning the population into more alike sub-groups. We developed a probabilistic model based on a Latent Class Analysis (LCA) that takes into account the familial dependence inside a pedigree, even for large pedigrees. It also takes into account individuals with missing and partially missing measurements. Estimation of model parameters is performed by an EM algorithm, and computations for the E step inside a pedigree are achieved using a pedigree peeling algorithm. When more than one model are fitted, we use model selection strategies such as cross-validation or/and BIC approaches to choose the suitable model among a set of candidates. Moreover, we present a simulation based on a genetic disease class model and we show that our model leads to better individual classification than the model that assumes independence among subjects. An application of our model to a Schizophrenia-Bipolar pedigree data set from Eastern Quebec is also performed.  相似文献   

8.
Data classification is an important area of data mining. Several well known techniques such as decision tree, neural network, etc. are available for this task. In this paper we propose a Kalman particle swarm optimized (KPSO) polynomial equation for classification for several well known data sets. Our proposed method is derived from some of the findings of the valuable information like number of terms, number and combination of features in each term, degree of the polynomial equation etc. of our earlier work on data classification using polynomial neural network. The KPSO optimizes these polynomial equations with a faster convergence speed unlike PSO. The polynomial equation that gives the best performance is considered as the model for classification. Our simulation result shows that the proposed approach is able to give competitive classification accuracy compared to PNN in many datasets.  相似文献   

9.
This paper proposes fuzzy symbolic modeling as a framework for intelligent data analysis and model interpretation in classification and regression problems. The fuzzy symbolic modeling approach is based on the eigenstructure analysis of the data similarity matrix to define the number of fuzzy rules in the model. Each fuzzy rule is associated with a symbol and is defined by a Gaussian membership function. The prototypes for the rules are computed by a clustering algorithm, and the model output parameters are computed as the solutions of a bounded quadratic optimization problem. In classification problems, the rules’ parameters are interpreted as the rules’ confidence. In regression problems, the rules’ parameters are used to derive rules’ confidences for classes that represent ranges of output variable values. The resulting model is evaluated based on a set of benchmark datasets for classification and regression problems. Nonparametric statistical tests were performed on the benchmark results, showing that the proposed approach produces compact fuzzy models with accuracy comparable to models produced by the standard modeling approaches. The resulting model is also exploited from the interpretability point of view, showing how the rule weights provide additional information to help in data and model understanding, such that it can be used as a decision support tool for the prediction of new data.  相似文献   

10.
The mine ventilation system is most important and technical measure for ensuring safety production in mines. The structural complexity of a mine ventilation network can directly affect the safety and reliability of the underground mining system. Quantitatively justifying the degree of complexity can contribute to providing a deeper understanding of the essential characteristics of a network. However, so far, there is no such a model which is able to simply, practically, reasonably, and quantitatively determine or compare the structural complexity of different ventilation networks. In this article, by analyzing some typical parameters of a mine ventilation network, we conclude that there is a linear functional relationship among five key parameters (number of ventilation network branches, number of nodes, number of independent circuits, number of independent paths, and number of diagonal branches). Correlation analyses for the main parameters of ventilation networks are conducted based on SPSS. Based on these findings, a new evaluation model for the structural complexity of ventilation network (which is represented by C) has been proposed. By combining SPSS classification analyses results with the characteristics of mine ventilation networks, standards for the complexity classification of mine ventilation systems are put forward. Using the developed model, we carried out analyses and comparisons for the structural complexity of ventilation networks for typical mines. Case demonstrations show that the classification results correspond to the actual situations. © 2014 Wiley Periodicals, Inc. Complexity 21: 21–34, 2015  相似文献   

11.
针对连续数据流分类问题,基于在线学习理论,提出一种在线logistic回归算法.研究带有正则项的在线logistic回归,提出了在线logistic-l2回归模型,并给出了理论界估计.最终实验结果表明,随着在线迭代次数的增加,提出的模型与算法能够达到离线预测的分类结果.本文工作为处理海量流数据分类问题提供了一种新的有效方法.  相似文献   

12.
In developing a classification model for assigning observations of unknown class to one of a number of specified classes using the values of a set of features associated with each observation, it is often desirable to base the classifier on a limited number of features. Mathematical programming discriminant analysis methods for developing classification models can be extended for feature selection. Classification accuracy can be used as the feature selection criterion by using a mixed integer programming (MIP) model in which a binary variable is associated with each training sample observation, but the binary variable requirements limit the size of problems to which this approach can be applied. Heuristic feature selection methods for problems with large numbers of observations are developed in this paper. These heuristic procedures, which are based on the MIP model for maximizing classification accuracy, are then applied to three credit scoring data sets.  相似文献   

13.
The curse of high-dimensionality has emerged in the statistical fields more and more frequently. Many techniques have been developed to address this challenge for classification problems. We propose a novel feature screening procedure for dichotomous response data. This new method can be implemented as easily as t-test marginal screening approach, and the proposed procedure is free of any subexponential tail probability conditions and moment requirement and not restricted in a specific model structure. We prove that our method possesses the sure screening property and also illustrate the effect of screening by Monte Carlo simulation and apply it to a real data example.  相似文献   

14.
Satisfying the global throughput targets of scientific applications is an important challenge in high performance computing (HPC) systems. The main difficulty lies in the high number of parameters having an important impact on the overall system performance. These include the number of storage servers, features of communication links, and the number of CPU cores per node, among many others.In this paper we present a model that computes a performance/cost ratio using different hardware configurations and focusing on scientific computing. The main goal of this approach is to balance the trade-off between cost and performance using different combinations of components for building the entire system. The main advantage of our approach is that we simulate different configurations in a complex simulation platform. Therefore, it is not necessary to make an investment until the system computes the different alternatives and the best solutions are suggested. In order to achieve this goal, both the system's architecture and Map-Reduce applications are modeled. The proposed model has been evaluated by building complex systems in a simulated environment using the SIMCAN simulation platform.  相似文献   

15.
粗糙集理论作为一种智能数据分析和数据挖掘的新的数学工具,其主要优点在于它不需要任何关于被处理数据的先验或额外知识.提出了一种基于粗糙集理论的智能数据分析模型,从目标数据集出发,通过数据预处理、数据分类和规则获取,实现对原始数据集的智能分析,并通过实例测试验证了该模型的有效性.  相似文献   

16.
17.
Artificial neural networks have been shown to perform well for two-group classification problems. However, current research has yet to determine a method for identifying relevant input variables in the neural network model for real world classification problems. The common practice in neural network research is to include all available input variables that could possibly contribute to the model without determination of whether they help in estimating the unknown function. One problem with this avenue of neural network research is the inability to extract the knowledge that could be useful to researchers by identifying those input variables that contribute to estimating the true underlying function of the data. A method has been proposed in past research, the Neural Network Simultaneous Optimization Algorithm (NNSOA), which was shown to be successful for a limited number of continuous problems. This research proposes using the NNSOA on a real world classification problem that not only finds good solutions for estimating unknown functions, but can also correctly identify those variables that contribute to the model.  相似文献   

18.
Feature selection plays an important role in the successful application of machine learning techniques to large real-world datasets. Avoiding model overfitting, especially when the number of features far exceeds the number of observations, requires selecting informative features and/or eliminating irrelevant ones. Searching for an optimal subset of features can be computationally expensive. Functional magnetic resonance imaging (fMRI) produces datasets with such characteristics creating challenges for applying machine learning techniques to classify cognitive states based on fMRI data. In this study, we present an embedded feature selection framework that integrates sparse optimization for regularization (or sparse regularization) and classification. This optimization approach attempts to maximize training accuracy while simultaneously enforcing sparsity by penalizing the objective function for the coefficients of the features. This process allows many coefficients to become zero, which effectively eliminates their corresponding features from the classification model. To demonstrate the utility of the approach, we apply our framework to three different real-world fMRI datasets. The results show that regularized classifiers yield better classification accuracy, especially when the number of initial features is large. The results further show that sparse regularization is key to achieving scientifically-relevant generalizability and functional localization of classifier features. The approach is thus highly suited for analysis of fMRI data.  相似文献   

19.
Classification is a main data mining task, which aims at predicting the class label of new input data on the basis of a set of pre-classified samples. Multiple criteria linear programming (MCLP) is used as a classification method in the data mining area, which can separate two or more classes by finding a discriminate hyperplane. Although MCLP shows good performance in dealing with linear separable data, it is no longer applicable when facing with nonlinear separable problems. A kernel-based multiple criteria linear programming (KMCLP) model is developed to solve nonlinear separable problems. In this method, a kernel function is introduced to project the data into a higher-dimensional space in which the data will have more chance to be linear separable. KMCLP performs well in some real applications. However, just as other prevalent data mining classifiers, MCLP and KMCLP learn only from training examples. In the traditional machine learning area, there are also classification tasks in which data sets are classified only by prior knowledge, i.e. expert systems. Some works combine the above two classification principles to overcome the faults of each approach. In this paper, we provide our recent works which combine the prior knowledge and the MCLP or KMCLP model to solve the problem when the input consists of not only training examples, but also prior knowledge. Specifically, how to deal with linear and nonlinear knowledge in MCLP and KMCLP models is the main concern of this paper. Numerical tests on the above models indicate that these models are effective in classifying data with prior knowledge.  相似文献   

20.
In this paper, we use neural network to classify schizophrenia patients and healthy control subjects. Based on 4005 high dimensions feature space consist of functional connectivity about 63 schizophrenic patients and 57 healthy control as the original data, attempting to try different dimensionality reduction methods, different neural network model to find the optimal classification model. The results show that using the Mann-Whitney U test to select the more discrimination features as input and using Elman neural network model for classification to get the best results, can reach a highest accuracy of 94.17%, with the sensitivity being 92.06% and the specificity being 96.49%. For the best classification neural network model, we identified 34 consensus functional connectivities that exhibit high discriminative power in classification, which includes 26 brain regions, particularly in the thalamus regions corresponding to the maximum number of functional connectivity edges, followed by the cingulate gyrus and frontal region.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号