首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 847 毫秒
1.
Gas chromatography and pattern recognition methods were used to develop a potential method for differentiating European honeybees from Africanized honeybees. The test data consisted of 237 gas chromatograms of hydrocarbon extracts obtained from the wax glands, cuticle, and exocrine glands of European and Africanized honeybees. Each gas chromatogram contained 65 peaks corresponding to a set of standardized retention time windows. A genetic algorithm (GA) for pattern recognition was used to identify features in the gas chromatograms characteristic of the genotype. The pattern recognition GA searched for features in the chromatograms that optimized the separation of the European and Africanized honeybees in a plot of the two or three largest principal components of the data. Because the largest principal components capture the bulk of the variance in the data, the peaks identified by the pattern recognition GA primarily contained information about differences between gas chromatograms of European and Africanized honeybees. The principal component analysis routine embedded in the fitness function of the pattern recognition GA acted as an information filter, significantly reducing the size of the search space since it restricted the search to feature sets whose principal component plots showed clustering on the basis of the bees' genotype. In addition, the algorithm focused on those classes and/or samples that were difficult to classify as it trained using a form of boosting. Samples that consistently classify correctly are not as heavily weighted as samples that are difficult to classify. Over time, the algorithm learns its optimal parameters in a manner similar to a neural network. The pattern recognition GA integrates aspects of artificial intelligence and evolutionary computations to yield a "smart" one-pass procedure for feature selection and classification.  相似文献   

2.
The water-soluble fraction of aviation jet fuels is examined using solid-phase extraction and solid-phase microextraction. Gas chromatographic profiles of solid-phase extracts and solid-phase microextracts of the water-soluble fraction of kerosene- and nonkerosene-based jet fuels reveal that each jet fuel possesses a unique profile. Pattern recognition analysis reveals fingerprint patterns within the data characteristic of fuel type. By using a novel genetic algorithm (GA) that emulates human pattern recognition through machine learning, it is possible to identify features characteristic of the chromatographic profile of each fuel class. The pattern recognition GA identifies a set of features that optimize the separation of the fuel classes in a plot of the two largest principal components of the data. Because principal components maximize variance, the bulk of the information encoded by the selected features is primarily about the differences between the fuel classes.  相似文献   

3.
In this paper, multivariate calibration of complicated process fluorescence data is presented. Two data sets related to the production of white sugar are investigated. The first data set comprises 106 observations and 571 spectral variables, and the second data set 268 observations and 3997 spectral variables. In both applications, a single response, ash content, is modelled and predicted as a function of the spectral variables. Both data sets contain certain features making multivariate calibration efforts non-trivial. The objective is to show how principal component analysis (PCA) and partial least squares (PLS) regression can be used to overview the data sets and to establish predictively sound regression models. It is shown how a recently developed technique for signal filtering, orthogonal signal correction (OSC), can be applied in multivariate calibration to enhance predictive power. In addition, signal compression is tested on the larger data set using wavelet analysis. It is demonstrated that a compression down to 4% of the original matrix size — in the variable direction — is possible without loss of predictive power. It is concluded that the combination of OSC for pre-processing and wavelet analysis for compression of spectral data is promising for future use.  相似文献   

4.
Du W  Gu T  Tang LJ  Jiang JH  Wu HL  Shen GL  Yu RQ 《Talanta》2011,85(3):1689-1694
As a greedy search algorithm, classification and regression tree (CART) is easily relapsing into overfitting while modeling microarray gene expression data. A straightforward solution is to filter irrelevant genes via identifying significant ones. Considering some significant genes with multi-modal expression patterns exhibiting systematic difference in within-class samples are difficult to be identified by existing methods, a strategy that unimodal transform of variables selected by interval segmentation purity (UTISP) for CART modeling is proposed. First, significant genes exhibiting varied expression patterns can be properly identified by a variable selection method based on interval segmentation purity. Then, unimodal transform is implemented to offer unimodal featured variables for CART modeling via feature extraction. Because significant genes with complex expression patterns can be properly identified and unimodal feature extracted in advance, this developed strategy potentially improves the performance of CART in combating overfitting or underfitting while modeling microarray data. The developed strategy is demonstrated using two microarray data sets. The results reveal that UTISP-based CART provides superior performance to k-nearest neighbors or CARTs coupled with other gene identifying strategies, indicating UTISP-based CART holds great promise for microarray data analysis.  相似文献   

5.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

6.
7.
The main objective of this study was to evaluate the capability of 120 aromatic chemicals to bind to the human alpha estrogen receptor (hER alpha) by the use of quantum similarity methods. The experimental data were segregated into two categories, i.e., those compounds with and without estrogenicity activity (active and inactive). To identify potential ligands, semiquantitative structure-activity relationships were developed for the complete set correlating the presence or lack of binding affinity to the estrogen receptor with structural features of the molecules. The structure-activity relationships were based upon molecular similarity indices, which implicitly contain information related to changes in the electron distributions of the molecules, along with indicator variables, accounting for several structural features. In addition, the whole set was split into several chemical classes for modeling purposes. Models were validated by dividing the complete set into several training and test sets to allow for external predictions to be made.  相似文献   

8.
In environmental chemistry studies, it may be necessary to analyze data sets constituted by different blocks of variables, possibly of different types, measured on the same samples. Multiple factor analysis (MFA) is presented as a tool for exploring such data. The most important features of MFA are shown on a real environmental data set, consisting of two blocks of data, namely heavy metals and polycyclic aromatic hydrocarbons, measured for sediment samples. They are discussed and compared to principal component analysis (PCA). The usefulness of the weighting scheme used in MFA as a preprocessing step for other chemometric methods, such as clustering, is also highlighted.  相似文献   

9.
For the clustering of chemical structures that are described by the Similog, ISIS count, and ISIS binary fingerprints, we propose a sequential superparamagnetic clustering approach. To appropriately handle nonbinary feature keys, we introduce an extension of the binary Tanimoto similarity measure. In our applications, data sets composed of structures from seven chemically distinct compound classes are evaluated and correctly clustered. The comparison, with results from leading methods, indicates the superiority of our sequential superparamagnetic clustering approach.  相似文献   

10.
Since the driver pathway in cancer plays a crucial role in the formation and progression of cancer, it is very imperative to identify driver pathways, which will offer important information for precision medicine or personalized medicine. In this paper, an improved maximum weight submatrix problem model is proposed by integrating such three kinds of omics data as somatic mutations, copy number variations, and gene expressions. The model tries to adjust coverage and mutual exclusivity with the average weight of genes in a pathway, and simultaneously considers the correlation among genes, so that the pathway having high coverage but moderate mutual exclusivity can be identified. By introducing a kind of short chromosome code and a greedy based recombination operator, a parthenogenetic algorithm PGA-MWS is presented to solve the model. Experimental comparisons among algorithms GA, MOGA, iMCMC and PGA-MWS were performed on biological and simulated data sets. The experimental results show that, compared with the other three algorithms, the PGA-MWS one based on the improved model can identify the gene sets with high coverage but moderate mutual exclusivity and scales well. Many of the identified gene sets are involved in known signaling pathways, most of the implicated genes are oncogenes or tumor suppressors previously reported in literatures. The experimental results indicate that the proposed approach may become a useful complementary tool for detecting cancer pathways.  相似文献   

11.
12.
A fuzzy c-means clustering algorithm is presented which is much faster than the traditional algorithm for data sets in which the number of features is significantly larger than the number of feature vectors. The algorithm is constructed by utilizing the covariance structure of feature vectors and cluster centers. By using results from a previous clustering, modified versions of the new algorithm achieve additional reductions in floating point operations. © 1995 by John Wiley & Sons, Inc.  相似文献   

13.
This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods.The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application.  相似文献   

14.
建立了一种基于不相交主成分分析(Disjoint PCA)和遗传算法(GA)的特征变量选择方法, 并用于从基因表达谱(Gene expression profiles)数据中识别差异表达的基因. 在该方法中, 用不相交主成分分析评估基因组在区分两类不同样品时的区分能力; 用GA寻找区分能力最强的基因组; 所识别基因的偶然相关性用统计方法评估. 由于该方法考虑了基因间的协同作用更接近于基因的生物过程, 从而使所识别的基因具有更好的差异表达能力. 将该方法应用于肝细胞癌(HCC)样品的基因芯片数据分析, 结果表明, 所识别的基因具有较强的区分能力, 优于常用的基因芯片显著性分析(Significance analysis of microarrays, SAM)方法.  相似文献   

15.
This work describes the first approach in the development of a comprehensive classification method for bitterness of small molecules. The data set comprises 649 bitter and 13 530 randomly selected molecules from the MDL Drug Data Repository (MDDR) which are analyzed by circular fingerprints (MOLPRINT 2D) and information-gain feature selection. The feature selection proposes substructural features which are statistically correlated to bitterness. Classification is performed on the selected features via a na?ve Bayes classifier. The substructural features upon which the classification is based are able to discriminate between bitter and random compounds, and thus we propose they are also functionally responsible for causing the bitter taste. Such substructures include various sugar moieties as well as highly branched carbon scaffolds. Cynaropicrine contains a number of the substructural features found to be statistically associated with bitterness and thus was correctly predicted to be bitter by our model. Alternatively, both promethazine and saccharin contain fewer of these substructural features, and thus the bitterness in these compounds was not identified. Two different classes of bitter compounds were identified, namely those which are larger and contain mainly oxygen and carbon and often sugar moieties, and those which are rather smaller and contain additional nitrogen and/or sulfur fragments. The classifier is able to predict 72.1% of the bitter compounds. Feature selection reduces the number of false-positives while also increasing the number of false negatives to 69.5% of bitter compounds correctly predicted. Overall, the method presented here presents both one of the largest databases of bitter compounds presently available as well as a relatively reliable classification method.  相似文献   

16.
Four genetic-algorithm-based approaches to variable selection in spectral data sets are presented. They range from a pure black-box approach to a chemically driven one. The latter uses a fitness function that takes into account not only typical parameters like the number of errors when classifying a training set but also the chemical interpretability of the selected variables. In order to cope with the fact that multiple solutions may be acceptable, a multimodal genetic algorithm (GA) is employed and the most satisfactory solution selected. The multimodal GA uses two populations (denominated "hybrid two populations" GA or HTP-GA): a classical population, from which potential solutions emerge, and a new population, which maintains diversity in the search space (as required by multimodal problems). Results show that the HTP-GA approach improves the chemical understanding of the selected solution (compared to other GA approaches) and that the classification capabilities of the approach are still good. All of the GA strategies for variable selection were compared with a classical parametric technique, Procrustes rotation, which does not consider interpretability.  相似文献   

17.
DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.  相似文献   

18.
Li-Juan Tang  Hai-Long Wu 《Talanta》2009,79(2):260-1694
One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. To circumvent the problem, a new gene mining approach is proposed based on the similarity between probability density functions on each gene for the class of interest with respect to the others. This method allows the ascertainment of significant genes that are informative for discriminating each individual class rather than maximizing the separability of all classes. Then one can select genes containing important information about the particular subtypes of diseases. Based on the mined significant genes for individual classes, a support vector machine with local kernel transform is constructed for the classification of different diseases. The combination of the gene mining approach with support vector machine is demonstrated for cancer classification using two public data sets. The results reveal that significant genes are identified for each cancer, and the classification model shows satisfactory performance in training and prediction for both data sets.  相似文献   

19.
Collecting, organizing, and reviewing chemical information associated with screening hits are human time-consuming. The task depends highly on the individual, and human errors may result in missing leads or wasting resources. To overcome these hurdles, we have developed a decision support system, Hits Analysis Database (HAD). HAD is a software tool that automatically generates an ISIS database file containing compound structures, biological activities, calculated properties such as clogP, hazard fragment labels, structure classifications, etc. All data are processed by available software and packed into a single SD file. In addition to search capabilities, HAD provides an overview of structural classes and associated activity statistics. Chemical structures can be organized by maximum common substructure clustering. The ease of use and customized features make HAD a chief tool in lead selection processes.  相似文献   

20.
The performance of the algorithm COMPLX for detecting protein-ligand or other macromolecular complexes has been tested for highly complex data sets. These data contain m/z values for ions of proteins of the SWISS-PROT database within simulated biological mixtures where each component shares a similar molecular weight and/or isoelectric point (pI). As many as 1600 ion signals were entered to challenge the algorithm to identify ion signals associated with a single protein complex that has been ionised and detected within a mass spectrometer. Despite the complexity of such data sets, the algorithm is shown to be able to identify the presence of individual bimolecular complexes. The output data can be re-evaluated by the user as necessary in light of any additional information that is known concerning the nature of predicted associations, as well as the quality of the data-set in terms of errors in m/z values as a direct consequence of the mass calibration or resolution achieved. The data presented illustrates that the best results are obtained when output results are ranked according to the largest continuous series of ion pairs detected for a protein or macromolecule and its complex for which the ligand mass is assigned the lowest mass error.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号