首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can lead either to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an important component for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabu search (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure enables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three different microarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data.  相似文献   

2.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

3.
Du W  Gu T  Tang LJ  Jiang JH  Wu HL  Shen GL  Yu RQ 《Talanta》2011,85(3):1689-1694
As a greedy search algorithm, classification and regression tree (CART) is easily relapsing into overfitting while modeling microarray gene expression data. A straightforward solution is to filter irrelevant genes via identifying significant ones. Considering some significant genes with multi-modal expression patterns exhibiting systematic difference in within-class samples are difficult to be identified by existing methods, a strategy that unimodal transform of variables selected by interval segmentation purity (UTISP) for CART modeling is proposed. First, significant genes exhibiting varied expression patterns can be properly identified by a variable selection method based on interval segmentation purity. Then, unimodal transform is implemented to offer unimodal featured variables for CART modeling via feature extraction. Because significant genes with complex expression patterns can be properly identified and unimodal feature extracted in advance, this developed strategy potentially improves the performance of CART in combating overfitting or underfitting while modeling microarray data. The developed strategy is demonstrated using two microarray data sets. The results reveal that UTISP-based CART provides superior performance to k-nearest neighbors or CARTs coupled with other gene identifying strategies, indicating UTISP-based CART holds great promise for microarray data analysis.  相似文献   

4.
Li-Juan Tang  Hai-Long Wu 《Talanta》2009,79(2):260-1694
One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. To circumvent the problem, a new gene mining approach is proposed based on the similarity between probability density functions on each gene for the class of interest with respect to the others. This method allows the ascertainment of significant genes that are informative for discriminating each individual class rather than maximizing the separability of all classes. Then one can select genes containing important information about the particular subtypes of diseases. Based on the mined significant genes for individual classes, a support vector machine with local kernel transform is constructed for the classification of different diseases. The combination of the gene mining approach with support vector machine is demonstrated for cancer classification using two public data sets. The results reveal that significant genes are identified for each cancer, and the classification model shows satisfactory performance in training and prediction for both data sets.  相似文献   

5.
Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification.  相似文献   

6.
High-throughput DNA microarray provides an effective approach to the monitoring of expression levels of thousands of genes in a sample simultaneously. One promising application of this technology is the molecular diagnostics of cancer, e.g. to distinguish normal tissue from tumor or to classify tumors into different types or subtypes. One problem arising from the use of microarray data is how to analyze the high-dimensional gene expression data, typically with thousands of variables (genes) and much fewer observations (samples). There is a need to develop reliable classification methods to make full use of microarray data and to evaluate accurately the predictive ability and reliability of such derived models. In this paper, discriminant partial least squares was used to classify the different types of human tumors using four microarray datasets and showed good prediction performance. Four different cross-validation procedures (leave-one-out versus leave-half-out; incomplete versus full) were used to evaluate the classification model. Our results indicate that discriminant partial least squares using leave-half-out cross-validation provides a more realistic estimate of the predictive ability of a classification model, which may be overestimated by some of the cross-validation procedures, and the information obtained from different cross-validation procedures can be used to evaluate the reliability of the classification model.  相似文献   

7.
Grouped gene selection is the most important task for analyzing the microarray data of rat liver regeneration. Many existing gene selection methods cannot outstand the interactions among the selected genes. In the process of rat liver regeneration, one of the most important events involved in many biological processes is the proliferation of rat hepatocytes, so it can be used as a measure of the effectiveness of the method. Here we proposed an adaptive sparse group lasso to select genes in groups for rat hepatocyte proliferation. The weighted gene co-expression networks analysis was used to identify modules corresponding to gene pathways, based on which a strategy of dividing genes into groups was proposed. A strategy of adaptive gene selection was also presented by assessing the gene significance and introducing the adaptive lasso penalty. Moreover, an improved blockwise descent algorithm was proposed. Experimental results demonstrated that the proposed method can improve the classification accuracy, and select less number of significant genes which act jointly in groups and have direct or indirect effects on rat hepatocyte proliferation. The effectiveness of the method was verified by the method of rat hepatocyte proliferation.  相似文献   

8.
Improved binary PSO for feature selection using gene expression data   总被引:2,自引:0,他引:2  
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. Compared to the number of genes involved, available training data sets generally have a fairly small sample size in cancer type classification. These training data limitations constitute a challenge to certain classification methodologies. A reliable selection method for genes relevant for sample classification is needed in order to speed up the processing rate, decrease the predictive error rate, and to avoid incomprehensibility due to the large number of genes investigated. Improved binary particle swarm optimization (IBPSO) is used in this study to implement feature selection, and the K-nearest neighbor (K-NN) method serves as an evaluator of the IBPSO for gene expression data classification problems. Experimental results show that this method effectively simplifies feature selection and reduces the total number of features needed. The classification accuracy obtained by the proposed method has the highest classification accuracy in nine of the 11 gene expression data test problems, and is comparative to the classification accuracy of the two other test problems, as compared to the best results previously published.  相似文献   

9.
We use a newly developed feature extraction and classification method to analyze previously published gene expression data sets in Oral Squamous Cell Carcinoma and in healthy oral mucosa in order to find a gene set sufficient for diagnoses. The feature selection technology is based on the relative dichotomy power concept published by us earlier. The resulting biomarker panel has 100% sensitivity and 95% specificity, is enriched in genes associated with oncogenesis and invasive tumor growth, and, unlike marker panels devised in earlier studies, shows concordance with previously published marker genes.  相似文献   

10.
Cell-based biosensors utilize functional changes in cellular response to identify the biological threats in a physiological relevant manner. Cell-based sensors have been used for a wide array of applications including toxicological assessment and drug-screening. In this paper, we utilize DNA arrays to identify differential gene expression events induced by toxin exposure for the purpose of developing a reporter gene assay system compatible with insertion into a cell-based sensor platform. HT29, an intestine epithelial cell line, was used as a cell model to study the cholera toxin (CT)-induced host cell modulation using DNA array analysis. A false positive model was generated from analysis of housekeeping genes in untreated control experiments to characterize our system and to minimize the number of false positives in the data. Threshold probability scores (−3.72), which gives <0.02% false positives for up/down regulation from the false positive model, were used to identify 73 and 25 known genes/expression tag sequences (ESTs) that were up- and down-regulated, respectively, in cells exposed 23 nM of CT. Using quantitative multiplex PCR assay, the gene expression levels for several genes shown to be modulated according to the microarray experiments, such as apolipoprotein D (Apol D), E-cadherin, and cyclin A2, were confirmed. The differential expression of genes encoding cytochrome P450, glutathione transferase (GST), and MGAT2 were noteworthy and consistent with previous studies. Our study provides an approach to analyze cDNA microarray data with defined false positive rates. The utility of cDNA microarray information for the design of cell-based sensor using a reporter gene approach is discussed.  相似文献   

11.
Gene regulatory networks inference is currently a topic under heavy research in the systems biology field. In this paper, gene regulatory networks are inferred via evolutionary model based on time-series microarray data. A non-linear differential equation model is adopted. Gene expression programming (GEP) is applied to identify the structure of the model and least mean square (LMS) is used to optimize the parameters in ordinary differential equations (ODEs). The proposed work has been first verified by synthetic data with noise-free and noisy time-series data, respectively, and then its effectiveness is confirmed by three real time-series expression datasets. Finally, a gene regulatory network was constructed with 12 Yeast genes. Experimental results demonstrate that our model can improve the prediction accuracy of microarray time-series data effectively.  相似文献   

12.
Gene expression data sets hold the promise to provide cancer diagnosis on the molecular level. However, using all the gene profiles for diagnosis may be suboptimal. Detection of the molecular signatures not only reduces the number of genes needed for discrimination purposes, but may elucidate the roles they play in the biological processes. Therefore, a central part of diagnosis is to detect a small set of tumor biomarkers which can be used for accurate multiclass cancer classification. This task calls for effective multiclass classifiers with built-in biomarker selection mechanism. We propose the sparse optimal scoring (SOS) method for multiclass cancer characterization. SOS is a simple prototype classifier based on linear discriminant analysis, in which predictive biomarkers can be automatically determined together with accurate classification. Thus, SOS differentiates itself from many other commonly used classifiers, where gene preselection must be applied before classification. We obtain satisfactory performance while applying SOS to several public data sets.  相似文献   

13.
建立了一种基于不相交主成分分析(Disjoint PCA)和遗传算法(GA)的特征变量选择方法, 并用于从基因表达谱(Gene expression profiles)数据中识别差异表达的基因. 在该方法中, 用不相交主成分分析评估基因组在区分两类不同样品时的区分能力; 用GA寻找区分能力最强的基因组; 所识别基因的偶然相关性用统计方法评估. 由于该方法考虑了基因间的协同作用更接近于基因的生物过程, 从而使所识别的基因具有更好的差异表达能力. 将该方法应用于肝细胞癌(HCC)样品的基因芯片数据分析, 结果表明, 所识别的基因具有较强的区分能力, 优于常用的基因芯片显著性分析(Significance analysis of microarrays, SAM)方法.  相似文献   

14.
Wang Q  Li HD  Xu QS  Liang YZ 《The Analyst》2011,136(7):1456-1463
Selecting a small subset of informative genes plays an important role in accurate prediction of clinical tumor samples. Based on model population analysis, a novel variable selection method, called noise incorporated subwindow permutation analysis (NISPA), is proposed in this study to work with support vector machines (SVMs). The essence of NISPA lies in the point that one noise variable is added into each sampled sub-dataset and then the distribution of variable importance of the added noise could be computed and serves as the common reference to evaluate the experimental variables. Further, by using the non-parametric Mann-Whitney U test, a P value can be assigned to each variable which describes to what extent the distributions of the gene variable and the noise variable are different. According to the computed P values, all the variables could be ranked and then a small subset of informative variables could be determined to build the model. Moreover, by NISPA, we are the first to distinguish the variables into a more detailed classification as informative, uninformative (noise) and interfering variables in comparison with other methods. In this study, two microarray datasets are employed to evaluate the performance of NISPA. The results show that the prediction errors of SVM classifiers could be significantly reduced by variable selection using NISPA. It is concluded that NISPA is a good alternative of variable selection algorithm.  相似文献   

15.
Dimension reduction is a crucial technique in machine learning and data mining, which is widely used in areas of medicine, bioinformatics and genetics. In this paper, we propose a two-stage local dimension reduction approach for classification on microarray data. In first stage, a new L1-regularized feature selection method is defined to remove irrelevant and redundant features and to select the important features (biomarkers). In the next stage, PLS-based feature extraction is implemented on the selected features to extract synthesis features that best reflect discriminating characteristics for classification. The suitability of the proposal is demonstrated in an empirical study done with ten widely used microarray datasets, and the results show its effectiveness and competitiveness compared with four state-of-the-art methods. The experimental results on St Jude dataset shows that our method can be effectively applied to microarray data analysis for subtype prediction and the discovery of gene coexpression.  相似文献   

16.
PK-means: A new algorithm for gene clustering   总被引:3,自引:0,他引:3  
Microarray technology has been widely applied in study of measuring gene expression levels for thousands of genes simultaneously. Gene cluster analysis is found useful for discovering the function of gene because co-expressed genes are likely to share the same biological function. K-means is one of well-known clustering methods. However, it is sensitive to the selection of an initial clustering and easily becoming trapped in a local minimum. Particle-pair optimizer (PPO) is a variation on the traditional particle swarm optimization (PSO) algorithm, which is stochastic particle-pair based optimization technique that can be applied to a wide range of problems. In this paper we bridges PPO and K-means within the algorithm PK-means for the first time. Our results indicate that PK-means clustering is generally more accurate than K-means and Fuzzy K-means (FKM). PK-means also has better robustness for it is less sensitive to the initial randomly selected cluster centroids. Finally, our algorithm outperforms these methods with fast convergence rate and low computation load.  相似文献   

17.
18.
A Bayesian network (BN) is a knowledge representation formalism that has proven to be a promising tool for analyzing gene expression data. Several problems still restrict its successful applications. Typical gene expression databases contain measurements for thousands of genes and no more than several hundred samples, but most existing BNs learning algorithms do not scale more than a few hundred variables. Current methods result in poor quality BNs when applied in such high-dimensional datasets. We propose a hybrid constraint-based scored-searching method that is effective for learning gene networks from DNA microarray data. In the first phase of this method, a novel algorithm is used to generate a skeleton BN based on dependency analysis. Then the resulting BN structure is searched by a scoring metric combined with the knowledge learned from the first phase. Computational tests have shown that the proposed method achieves more accurate results than state-of-the-art methods. This method can also be scaled beyond datasets with several hundreds of variables.  相似文献   

19.
《Analytical letters》2012,45(18):2833-2842
Traditional gene expression programming for classification is designed for binary decisions. Herein, projection discriminant analysis for direct multiclass categorization using gene expression programming is described. Gene expression programming was first employed to examine new synthetic variables that were built as nonlinear combinations of the original features. The data were projected on planes spanned by these new synthetic variables and the nearest centroid was employed to classify new samples. A new objective function was formulated to determine optimum synthetic variables. Direct multiclass categorization using a gene expression programming algorithm was used to classify six tea varieties analyzed by near infrared spectroscopy. Compared with traditional gene expression programming, principal component analysis, and linear discriminant analysis, direct multiclass categorization with gene expression programming algorithm was more efficient. Visual inspection of high dimensional data by this approach also facilitated classification and comprehension of data.  相似文献   

20.
Biomarker discovery is a typical application from functional genomics. Due to the large number of genes studied simultaneously in microarray data, feature selection is a key step. Swarm intelligence has emerged as a solution for the feature selection problem. However, swarm intelligence settings for feature selection fail to select small features subsets. We have proposed a swarm intelligence feature selection algorithm based on the initialization and update of only a subset of particles in the swarm. In this study, we tested our algorithm in 11 microarray datasets for brain, leukemia, lung, prostate, and others. We show that the proposed swarm intelligence algorithm successfully increase the classification accuracy and decrease the number of selected features compared to other swarm intelligence methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号