首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A fast and objective chemometric classification method is developed and applied to the analysis of gas chromatography (GC) data from five commercial gasoline samples. The gasoline samples serve as model mixtures, whereas the focus is on the development and demonstration of the classification method. The method is based on objective retention time alignment (referred to as piecewise alignment) coupled with analysis of variance (ANOVA) feature selection prior to classification by principal component analysis (PCA) using optimal parameters. The degree-of-class-separation is used as a metric to objectively optimize the alignment and feature selection parameters using a suitable training set thereby reducing user subjectivity, as well as to indicate the success of the PCA clustering and classification. The degree-of-class-separation is calculated using Euclidean distances between the PCA scores of a subset of the replicate runs from two of the five fuel types, i.e., the training set. The unaligned training set that was directly submitted to PCA had a low degree-of-class-separation (0.4), and the PCA scores plot for the raw training set combined with the raw test set failed to correctly cluster the five sample types. After submitting the training set to piecewise alignment, the degree-of-class-separation increased (1.2), but when the same alignment parameters were applied to the training set combined with the test set, the scores plot clustering still did not yield five distinct groups. Applying feature selection to the unaligned training set increased the degree-of-class-separation (4.8), but chemical variations were still obscured by retention time variation and when the same feature selection conditions were used for the training set combined with the test set, only one of the five fuels was clustered correctly. However, piecewise alignment coupled with feature selection yielded a reasonably optimal degree-of-class-separation for the training set (9.2), and when the same alignment and ANOVA parameters were applied to the training set combined with the test set, the PCA scores plot correctly classified the gasoline fingerprints into five distinct clusters.  相似文献   

2.
Many analytical approaches such as mass spectrometry generate large amounts of data (input variables) per sample analysed, and not all of these variables are important or related to the target output of interest. The selection of a smaller number of variables prior to sample classification is a widespread task in many research studies, where attempts are made to seek the lowest possible set of variables that are still able to achieve a high level of prediction accuracy; in other words, there is a need to generate the most parsimonious solution when the number of input variables is huge but the number of samples/objects are smaller. Here, we compare several different variable selection approaches in order to ascertain which of these are ideally suited to achieve this goal. All variable selection approaches were applied to the analysis of a common set of metabolomics data generated by Curie-point pyrolysis mass spectrometry (Py-MS), where the goal of the study was to classify the Gram-positive bacteria Bacillus. These approaches include stepwise forward variable selection, used for linear discriminant analysis (LDA); variable importance for projection (VIP) coefficient, employed in partial least squares-discriminant analysis (PLS-DA); support vector machines-recursive feature elimination (SVM-RFE); as well as the mean decrease in accuracy and mean decrease in Gini, provided by random forests (RF). Finally, a double cross-validation procedure was applied to minimize the consequence of overfitting. The results revealed that RF with its variable selection techniques and SVM combined with SVM-RFE as a variable selection method, displayed the best results in comparison to other approaches.  相似文献   

3.
4.
5.
A spectral clustering method is presented and applied to two-dimensional molecular structures, where it has been found particularly useful in the analysis of screening data. The method provides a means to quantify (1) the degree of intermolecular similarity within a cluster and (2) the contribution that the features of a molecule make to a cluster. In an application of the spectral clustering method to an example data set of 125 COX-2 inhibitors, these two criteria were used to place the molecules into clusters of chemically related two-dimensional structures.  相似文献   

6.
7.
Sinkov NA  Harynuk JJ 《Talanta》2011,83(4):1079-1087
A novel metric termed cluster resolution is presented. This metric compares the separation of clusters of data points while simultaneously considering the shapes of the clusters and their relative orientations. Using cluster resolution in conjunction with an objective variable ranking metric allows for fully automated feature selection for the construction of chemometric models. The metric is based upon considering the maximum size of confidence ellipses around clusters of points representing different classes of objects that can be constructed without any overlap of the ellipses. For demonstration purposes we utilized PCA to classify samples of gasoline based upon their octane rating. The entire GC-MS chromatogram of each sample comprising over 2 × 106 variables was considered. As an example, automated ranking by ANOVA was applied followed by a forward selection approach to choose variables for inclusion. This approach can be generally applied to feature selection for a variety of applications and represents a significant step towards the development of fully automated, objective construction of chemometric models.  相似文献   

8.
Feature selection is a key step in Quantitative Structure Activity Relationship (QSAR) analysis. Chance correlations and multicollinearity are two major problems often encountered when attempting to find generalized QSAR models for use in drug design. Optimal QSAR models require an objective variable relevance analysis step for producing robust classifiers with low complexity and good predictive accuracy. Genetic algorithms coupled with information theoretic approaches such as mutual information have been used to find near-optimal solutions to such multicriteria optimization problems. In this paper, we describe a novel approach for analyzing QSAR data based on these methods. Our experiments with the Thrombin dataset, previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001 demonstrate the feasibility of this approach. It has been found that it is important to take into account the data distribution, the rule "interestingness", and the need to look at more invariant and monotonic measures of feature selection.  相似文献   

9.
Part of the latest SAMPL challenge was to predict how a small fragment library of 500 commercially available compounds would bind to a protein target. In order to assess the modellers' work, a reasonably comprehensive set of data was collected using a number of techniques. These included surface plasmon resonance, isothermal titration calorimetry, protein crystallization and protein crystallography. Using these techniques we could determine the kinetics of fragment binding, the energy of binding, how this affects the ability of the target to crystallize, and when the fragment did bind, the pose or orientation of binding. Both the final data set and all of the raw images have been made available to the community for scrutiny and further work. This overview sets out to give the parameters of the experiments done and what might be done differently for future studies.  相似文献   

10.
The possibility provided by Chemometrics to extract and combine (fusion) information contained in NIR and MIR spectra in order to discriminate monovarietal extra virgin olive oils according to olive cultivar (Casaliva, Leccino, Frantoio) has been investigated.Linear discriminant analysis (LDA) was applied as a classification technique on these multivariate and non-specific spectral data both separately and jointly (NIR and MIR data together).In order to ensure a more appropriate ratio between the number of objects (samples) and number of variables (absorbance at different wavenumbers), LDA was preceded either by feature selection or variable compression. For feature selection, the SELECT algorithm was used while a wavelet transform was applied for data compression.Correct classification rates obtained by cross-validation varied between 60% and 90% depending on the followed procedure. Most accurate results were obtained using the fused NIR and MIR data, with either feature selection or data compression.Chemometrical strategies applied to fused NIR and MIR spectra represent an effective method for classification of extra virgin olive oils on the basis of the olive cultivar.  相似文献   

11.
In the past decade, there has been an increase in the use of sparse multivariate calibration methods in chemometrics. Sparsity describes a parsimonious state of model complexity and can be defined in terms of a subset of samples or covariates (e.g., wavelengths) that are used to define the calibration model. With respect to their classical counterparts such as principal component regression or partial least squares, sparse models are more easily interpretable and have been shown to exhibit non‐inferior prediction performance. However, sparse methods are still not as fast as the classical methods in spite of recent numerical advances. In addition, for many chemometricians, sparse methods are still “black‐box” algorithms whose internal workings are not well understood. In this paper, we describe a simple framework whereby classical multivariate calibration methods can be iteratively used to generate sparse models. Moreover, this approach allows for either wavelength or sample sparsity. We demonstrate the effectiveness of this approach on two spectroscopic data sets. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

12.
A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

13.
14.
In this work, a selection of the best features for multivariate forensic glass classification using Scanning Electron Microscopy coupled with an Energy Dispersive X-ray spectrometer (SEM-EDX) has been performed. This has been motivated by the fact that the databases available for forensic glass classification are sparse nowadays, and the acquisition of SEM-EDX data is both costly and time-consuming for forensic laboratories. The database used for this work consists of 278 glass objects for which 7 variables, based on their elemental compositions obtained with SEM-EDX, are available. Two categories are considered for the classification task, namely containers and car/building windows, both of them typical in forensic casework. A multivariate model is proposed for the computation of the likelihood ratios. The feature selection process is carried out by means of an exhaustive search, with an Empirical Cross-Entropy (ECE) objective function. The ECE metric takes into account not only the discriminating power of the model in use, but also its calibration, which indicates whether or not the likelihood ratios are interpretable in a probabilistic way. Thus, the proposed model is applied to all the 63 possible univariate, bivariate and trivariate combinations taken from the 7 variables in the database, and its performance is ranked by its ECE. Results show remarkable accuracy of the best variables selected following the proposed procedure for the task of classifying glass fragments into windows (from cars or buildings) or containers, obtaining high (almost perfect) discriminating power and good calibration. This allows the proposed models to be used in casework. We also present an in-depth analysis which reveals the benefits of the proposed ECE metric as an assessment tool for classification models based on likelihood ratios.  相似文献   

15.
Variable selection using a genetic algorithm is combined with partial least squares (PLS) for the prediction of additive concentrations in polymer films using Fourier transform-infrared (FT-IR) spectral data. An approach using an iterative application of the genetic algorithm is proposed. This approach allows for all variables to be considered and at the same time minimizes the risk of overfitting. We demonstrate that the variables selected by the genetic algorithm are consistent with expert knowledge. This very exciting result is a convincing application that the algorithm can select correct variables in an automated fashion.  相似文献   

16.
17.
18.
The goal of this paper is to present and describe a novel 2D- and 3D-QSAR (quantitative structure-activity relationship) binary classification data set for the inhibition of c-Jun N-terminal kinase-3 with previously unpublished activities for a diverse set of compounds. JNK3 is an important pharmaceutical target because it is involved in many neurological disorders. Accordingly, the development of JNK3 inhibitors has gained increasing interest. 2D and 3D versions of the data set were used, consisting of 313 (70 actives) and 249 (60 actives) compounds, respectively. All compounds, for which activity was only determined for the racemate, were removed from the 3D data set. We investigated the diversity of the data sets by an agglomerative clustering with feature trees and show that the data set contains several different scaffolds. Furthermore, we show that the benchmarks can be tackled with standard supervised learning algorithms with a convincing performance. For the 2D problem, a random decision forest classifier achieves a Matthew's correlation coefficient of 0.744, the 3D problem could be modeled with a Matthew's correlation coefficient of 0.524 with 3D pharmacophores and a support vector machine. The performance of both data sets was evaluated within a nested 10-fold cross-validation. We therefore suggest that the data set is a reasonable basis for generating QSAR models for JNK3 because of its diverse composition and the performance of the classifiers presented in this study.  相似文献   

19.
The correlation of the only two error sources in the solution of the electronic Schrödinger equation is addressed: the basis set convergence (incompleteness) error (BSIE) and the electron correlation effect. The electron correlation effect and basis set incompleteness error are found to be correlated for all of the molecules in Grimme??s ??mindless?? data set (MB08-165). One can use an extrapolation to the HF or MP2 complete basis set (CBS) limit to see with which type of quantum chemical problem (??simple?? and ??hard??) the researcher is dealing. The origin of the slow convergence of the partial wave expansion can be the Kato cusp condition for electron?Celectron coalescence. Such an extrapolation is possible for many large molecular systems and would give the researcher an idea about the expected electron correlation level that would lead to the desired theoretical accuracy. In other words, it is possible to use not only the CBS energy value itself but the speed with which it is reached to get extra information about the molecular system under study.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号