首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Classification problems have received considerable attention in biological and medical applications. In particular, classification methods combining to microarray technology play an important role in diagnosing and predicting disease, such as cancer, in medical research. Primary objective in classification is to build an optimal classifier based on the training sample in order to predict unknown class in the test sample. In this paper, we propose a unified approach for optimal gene classification with conjunction with functional principal component analysis (FPCA) in functional data analysis (FNDA) framework to classify time-course gene expression profiles based on information from the patterns. To derive an optimal classifier in FNDA, we also propose to find optimal number of bases in the smoothing step and functional principal components in FPCA using a cross-validation technique, and compare the performance of some popular classification techniques in the proposed setting. We illustrate the propose method with a simulation study and a real world data analysis.  相似文献   

2.
The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subsp...  相似文献   

3.
DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.  相似文献   

4.
Class prediction based on DNA microarray data has been emerged as one of the most important application of bioinformatics for diagnostics/prognostics. Robust classifiers are needed that use most biologically relevant genes embedded in the data. A consensus approach that combines multiple classifiers has attributes that mitigate this difficulty compared to a single classifier. A new classification method named as consensus analysis of multiple classifiers using non-repetitive variables (CAMCUN) was proposed for the analysis of hyper-dimensional gene expression data. The CAMCUN method combined multiple classifiers, each of which was built from distinct, non-repeated genes that were selected for effectiveness in class differentiation. Thus, the CAMCUN utilized most biologically relevant genes in the final classifier. The CAMCUN algorithm was demonstrated to give consistently more accurate predictions for two well-known datasets for prostate cancer and leukemia. Importantly, the CAMCUN algorithm employed an integrated 10-fold cross-validation and randomization test to assess the degree of confidence of the predictions for unknown samples.  相似文献   

5.
The application of a new method to the multivariate analysis of incomplete data sets is described. The new method, called maximum likelihood principal component analysis (MLPCA), is analogous to conventional principal component analysis (PCA), but incorporates measurement error variance information in the decomposition of multivariate data. Missing measurements can be handled in a reliable and simple manner by assigning large measurement uncertainties to them. The problem of missing data is pervasive in chemistry, and MLPCA is applied to three sets of experimental data to illustrate its utility. For exploratory data analysis, a data set from the analysis of archeological artifacts is used to show that the principal components extracted by MLPCA retain much of the original information even when a significant number of measurements are missing. Maximum likelihood projections of censored data can often preserve original clusters among the samples and can, through the propagation of error, indicate which samples are likely to be projected erroneously. To demonstrate its utility in modeling applications, MLPCA is also applied in the development of a model for chromatographic retention based on a data set which is only 80% complete. MLPCA can predict missing values and assign error estimates to these points. Finally, the problem of calibration transfer between instruments can be regarded as a missing data problem in which entire spectra are missing on the ‘slave’ instrument. Using NIR spectra obtained from two instruments, it is shown that spectra on the slave instrument can be predicted from a small subset of calibration transfer samples even if a different wavelength range is employed. Concentration prediction errors obtained by this approach were comparable to cross-validation errors obtained for the slave instrument when all spectra were available.  相似文献   

6.
Solid-phase microextraction in headspace mode coupled with gas chromatography-mass spectrometry was applied to the determination of volatile compounds in 30 commercially available coffee samples. In order to differentiate and characterize Arabica and Robusta coffee, six major volatile compounds (acetic acid, 2-methylpyrazine, furfural, 2-furfuryl alcohol, 2,6-dimethylpyrazine, 5-methylfurfural) were chosen as the most relevant markers. Cluster analysis and principal component analysis (PCA) were applied to the raw chromatographic data and data processed by centred logratio transformation.  相似文献   

7.
Improved binary PSO for feature selection using gene expression data   总被引:2,自引:0,他引:2  
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. Compared to the number of genes involved, available training data sets generally have a fairly small sample size in cancer type classification. These training data limitations constitute a challenge to certain classification methodologies. A reliable selection method for genes relevant for sample classification is needed in order to speed up the processing rate, decrease the predictive error rate, and to avoid incomprehensibility due to the large number of genes investigated. Improved binary particle swarm optimization (IBPSO) is used in this study to implement feature selection, and the K-nearest neighbor (K-NN) method serves as an evaluator of the IBPSO for gene expression data classification problems. Experimental results show that this method effectively simplifies feature selection and reduces the total number of features needed. The classification accuracy obtained by the proposed method has the highest classification accuracy in nine of the 11 gene expression data test problems, and is comparative to the classification accuracy of the two other test problems, as compared to the best results previously published.  相似文献   

8.
DNA arrays have become the immediate choice in the analysis of large-scale expression measurements. Understanding the expression pattern of genes provide functional information on newly identified genes by computational approaches. Gene expression pattern is an indicator of the state of the cell, and abnormal cellular states can be inferred by comparing expression profiles. Since co-regulated genes, and genes involved in a particular pathway, tend to show similar expression patterns, clustering expression patterns has become the natural method of choice to differentiate groups. However, most methods based on cluster analysis suffer from the usual problems (i) dead units, and (ii) the problem of determining the correct number of clusters (k) needed to classify the data. Selecting the k has been an open problem of pattern recognition and statistics for decades. Since clustering reveals similar patterns present in the data, fixing this number strongly influences the quality of the result. While there is no theoretical solution to this problem, the number of clusters can be decided by a heuristic clustering algorithm called rival penalized competitive learning (RPCL). We present a novel implementation of RPCL that transforms the correct number of clusters problem to the tractable problem of clustering based on the degree of similarity. This is biologically significant since our implementation clusters functionally co-regulated genes and genes that present similar patterns of expression. This new approach reveals potential genes that are co-involved in a biological process. This implementation of the RPCL algorithm is useful in differentiating groups involved in concerted functional regulation and helps to progressively home into patterns, which are closely similar.  相似文献   

9.
High resolution time-of-flight secondary ion mass spectrometry (HR TOF-SIMS) is a powerful surface analytical method. For complex samples, this technique may yield intricate spectra that are difficult to interpret visually. Chemometric methods are useful for data analysis. However, these methods require that spectra are represented in a matrix format. Variances in mass measurements caused by calibration or instrumental effects may present difficulties in properly aligning mass spectral peaks into the correct columns of the data matrix. Cluster analysis of resolution elements is proposed as an alternative approach to construct the data matrix. An automated method for optimizing the data alignment is presented and evaluated for standard steel samples.  相似文献   

10.
11.
The dispersion of the quantitative results in the analysis of volatile compounds from multicomponent mixtures by different fractionation techniques (solid-phase microextraction and direct thermal desorption) followed by GC or GC-MS presents nonrandom patterns related to the existence of different factors in the fractionation process or in the chromatographic separation which affect, to a different extent, the recovery of the sample components. Statistical techniques have been used to show the relative importance of these factors. The improvement in data precision achieved by using volatile compound concentration ratios is discussed.  相似文献   

12.
High-throughput data have been widely used in biological and medical studies to discover gene and protein functions. Due to the high dimensionality, principal component analysis (PCA) is often involved for data dimension reduction. However, when a few principal components (PCs) are selected for dimension reduction or considered for dimension determination, they are typically ranked by their variances, eigenvalues. However, this approach is not always effective in subsequent multivariate analysis, particularly classification. To maximize information from data with a subset of the components, we apply a different ranking criterion, canonical variate criterion, which considers within- and between-group variance rather than total variance in the classical criterion. Four prevalent classification methods are considered and compared using leave-one-out cross-validation. These methods are illustrated with three real high-throughput data sets, two microarray data sets and a nuclear magnetic resonance spectra data set.  相似文献   

13.
One of the most important physicochemical parameters of a molecule that determines its bioactivity is its lipophilicity. Cluster analysis (CA), principal component analysis (PCA), and sum of ranking differences (SRD) were used to compare the lipophilic parameters of twenty phenylacetamide derivatives, obtained experimentally as chromatographic retention data in the presence of different solvents and calculated by different mathematical methods. All the applied methods of multivariate analysis gave approximately similar grouping of the studied lipophilic parameters. In the attempt to group the investigated compounds in respect of their lipophilicity, the obtained results appeared to be dependent on the applied chemometric method. The CA and PCA, grouped the compounds on the basis of the nature of the substituents R1 and R2, indicating that they determine to a great extent the lipophilicity of the investigated molecules. Unlike them, the SRD method could not be used to group the studied compounds on the basis of their lipophilic character.  相似文献   

14.
Data analysis is an essential tenet of analytical chemistry, extending the possible information obtained from the measurement of chemical phenomena. Chemometric methods have grown considerably in recent years, but their wide use is hindered because some still consider them too complicated. The purpose of this review is to describe a multivariate chemometric method, principal component regression, in a simple manner from the point of view of an analytical chemist, to demonstrate the need for proper quality-control (QC) measures in multivariate analysis and to advocate the use of residuals as a proper QC method.  相似文献   

15.
This commentary highlights the issue of real differences between stationary phases that were studied in an experimental paper entitled “Novel stationary phases based on asphaltenes for gas chromatography” prepared by Grzegorz Boczkaj and co‐authors (J. Sep. Sci. 2016, 39, 2527–2536). Particularly, a chemometric study has revealed relatively small differences between stationary phases investigated. Moreover, simple principle component analysis calculations enabled the identification of the outlier points within large raw dataset and to find the parameters (variables) that may carry equal information.  相似文献   

16.
Gene expression patterns from NCI's panel of 60 cell lines were used to train a Neural Network model for classifying genes to pathways. The model assigns probabilities to each gene for each of the 21 modeled pathways assigned by the Kyoto Encyclopedia of Genes and Genomes. Cross-validation of the model showed that 10 of the 21 pathways exhibited good performance in statistical significance and accuracy. The model was designed to output gene probabilities that could be screened for higher probabilities resulting in higher confidence in classification though yielding fewer genes per pathway. The model was deployed on 5798 genes and our approach allowed us to ascertain the most relevant genes above an estimated background. Eight pathways were identified with both good cross-validation and significant numbers above background, TCA Cycle, Oxidative Phosphorylation, Porphyrin Biosynthesis, Ribosome, Polymerases, Proteasome, Cell Cycle, and Cell Adhesion. Gene Ontology (GO) annotation was used for additional validation of gene classification results. A total of 551 GO annotated genes and 468 unannotated genes were classified to the 8 pathways. The primary and secondary classifications of genes revealed known pathway relationships and provide the potential for discovering new pathway relationships.  相似文献   

17.
Genomics-based technologies in systems biology have gained a lot of popularity in recent years. These technologies generate large amounts of data. To obtain information from this data, multivariate data analysis methods are required. Many of the datasets generated in genomics are multilevel datasets, in which the variation occurs on different levels simultaneously (e.g. variation between organisms and variation in time). We introduce multilevel component analysis (MCA) into the field of metabolic fingerprinting to separate these different types of variation. This is in contrast to the commonly used principal component analysis (PCA) that is not capable of doing this: in a PCA model the different types of variation in a multilevel dataset are confounded.

MCA generates different submodels for different types of variation. These submodels are lower-dimensional component models in which the variation is approximated. These models are easier to interpret than the original data. Multilevel simultaneous component analysis (MSCA) is a method within the class of MCA models with increased interpretability, due to the fact that the time-resolved variation of all individuals is expressed in the same subspace.

MSCA is applied on a time-resolved metabolomics dataset. This dataset contains 1H NMR spectra of urine collected from 10 monkeys at 29 time-points during 2 months. The MSCA model contains a submodel describing the biorhythms in the urine composition and a submodel describing the variation between the animals. Using MSCA the largest biorhythms in the urine composition and the largest variation between the animals are identified.

Comparison of the MSCA model to a PCA model of this data shows that the MSCA model is better interpretable: the MSCA model gives a better view on the different types of variation in the data since they are not confounded.  相似文献   


18.
The development of a classification system based on the Raman spectra of milk samples is proposed in present study. Such development could be useful for nutritionists in suggesting healthy food to infants for their proper growth. Previously, molecular structures in milk samples have been exploited by Raman spectroscopy. In the current study, Raman spectral data of milk samples of different species is utilized for multi-class classification using a dimensionality reduction technique in combination with random forest (RF) classifier. Quantitative and experimental analysis is based on locally collected milk samples of different species including cow, buffalo, goat and human. This classification is based on the variations (different concentrations of the components present in milk such as proteins, milk fats, lactose etc.) in the intensities of Raman peaks of milk samples. Principal component analysis (PCA) is used as a dimensionality reduction technique in combination with RF to highlight the variations which can differentiate the Raman spectra of milk samples from different species. The proposed technique has demonstrated sufficient potential to be used for differentiation between milk samples of different species as the average accuracy of about 93.7%, precision of about 94%, specificity of about 97% and sensitivity of about 93% has been achieved.  相似文献   

19.
PK-means: A new algorithm for gene clustering   总被引:3,自引:0,他引:3  
Microarray technology has been widely applied in study of measuring gene expression levels for thousands of genes simultaneously. Gene cluster analysis is found useful for discovering the function of gene because co-expressed genes are likely to share the same biological function. K-means is one of well-known clustering methods. However, it is sensitive to the selection of an initial clustering and easily becoming trapped in a local minimum. Particle-pair optimizer (PPO) is a variation on the traditional particle swarm optimization (PSO) algorithm, which is stochastic particle-pair based optimization technique that can be applied to a wide range of problems. In this paper we bridges PPO and K-means within the algorithm PK-means for the first time. Our results indicate that PK-means clustering is generally more accurate than K-means and Fuzzy K-means (FKM). PK-means also has better robustness for it is less sensitive to the initial randomly selected cluster centroids. Finally, our algorithm outperforms these methods with fast convergence rate and low computation load.  相似文献   

20.
In spectroscopy the measured spectra are typically plotted as a function of the wavelength (or wavenumber), but analysed with multivariate data analysis techniques (multiple linear regression (MLR), principal components regression (PCR), partial least squares (PLS)) which consider the spectrum as a set of m different variables. From a physical point of view it could be more informative to describe the spectrum as a function rather than as a set of points, hereby taking into account the physical background of the spectrum, being a sum of absorption peaks for the different chemical components, where the absorbance at two wavelengths close to each other is highly correlated. In a first part of this contribution, a motivating example for this functional approach is given. In a second part, the potential of functional data analysis is discussed in the field of chemometrics and compared to the ubiquitous PLS regression technique using two practical data sets. It is shown that for spectral data, the use of B-splines proves to be an appealing basis to accurately describe the data. By applying both functional data analysis and PLS on the data sets the predictive ability of functional data analysis is found to be comparable to that of PLS. Moreover, many chemometric datasets have some specific structure (e.g. replicate measurements, on the same object or objects that are grouped), but the structure is often removed before analysis (e.g. by averaging the replicates). In the second part of this contribution, we suggest a method to adapt traditional analysis of variance (ANOVA) methods to datasets with spectroscopic data. In particular, the possibilities to explore and interpret sources of variation, such as variations in sample and ambient temperature, are examined. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号