首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
A fast method that can be used to classify unknown jet fuel types or detect possible property changes in jet fuel physical properties is of paramount interest to national defense and the airline industries. While fast gas chromatography (GC) has been used with conventional mass spectrometry (MS) to study jet fuels, fast GC was combined with fast scanning MS and used to classify jet fuels into lot numbers or origin for the first time by using fuzzy rule-building expert system (FuRES) classifiers. In the process of building classifiers, the data were pretreated with and without wavelet transformation and evaluated with respect to performance. Principal component transformation was used to compress the two-way data images prior to classification. Jet fuel samples were successfully classified with 99.8 ± 0.5% accuracy for both with and without wavelet compression. Ten bootstrapped Latin partitions were used to validate the generalized prediction accuracy. Optimized partial least squares (o-PLS) regression results were used as positively biased references for comparing the FuRES prediction results. The prediction results for the jet fuel samples obtained with these two methods were compared statistically. The projected difference resolution (PDR) method was also used to evaluate the fast GC and fast MS data. Two batches of aliquots of ten new samples were prepared and run independently 4 days apart to evaluate the robustness of the method. The only change in classification parameters was the use of polynomial retention time alignment to correct for drift that occurred during the 4-day span of the two collections. FuRES achieved perfect classifications for four models of uncompressed three-way data. This fast GC/fast MS method furnishes characteristics of high speed, accuracy, and robustness. This mode of measurement may be useful as a monitoring tool to track changes in the chemical composition of fuels that may also lead to property changes.  相似文献   

2.
The performance of four methods for supervised probabilistic classification (LDA, SIMCA, ALLOC and CLASSY) on three types of data sets is evaluated by means of a simulation study. The methods are also applied to some practical data sets (Iris and four data sets for wines). The evaluation criterion used for discriminatory ability is the CBS (complemented Brier score) because it has some advantages over other measures. The danger of applying resubstitution evaluation for method comparison is demonstrated, but leave-one-out evaluation is shown to perform satisfactorily. Horn's method for selecting the number of principal components in SIMCA and CLASSY models is shown to be superior to the average-eigenvalue criterion. It is concluded that CLASSY is a robust method, but that in practice all the methods investigated perform about equally well on average.  相似文献   

3.
A bootstrapped fuzzy rule-building expert system (FuRES) and a bootstrapped t-statistical weight feature selection method were individually used to select informative features from gas chromatography/mass spectrometry (GC/MS) chemical profiles of basil plants cultivated by organic and conventional farming practices. Feature subsets were selected from two-way GC/MS data objects, total ion chromatograms, and total mass spectra, separately. Four economic classifiers based on the bootstrapped FuRES approach, i.e., fuzzy optimal associative memory (e-FOAM), e-FuRES, partial least-squares–discriminant analysis (e-PLS-DA), and soft independent modeling by class analogy (e-SIMCA), and four economic classifiers based on the bootstrapped t-weight approach, i.e., e-PLS-DA-t, e-FOAM-t, e-FuRES-t, and e-SIMCA-t, were constructed thereafter to be compared with full-size classifiers obtained from the entire GC/MS data objects (i.e., FOAM, FuRES, PLS-DA, and SIMCA). By using three features selected from two-way data objects, the average classification rates with e-FOAM, e-FuRES, e-PLS-DA, and e-SIMCA were 95.3?±?0.5 %, 100 %, 100 %, and 91.8?±?0.2 %, respectively. The established economic classifiers were used to classify a new validation set collected 2.5 months later with no parametric change to experimental procedure. Classification rates with e-FOAM, e-FuRES, e-PLS-DA, and e-SIMCA were 96.7 %, 100 %, 100 %, and 96.7 %, respectively. Characteristic components in basil extracts corresponding to highest-ranked useful features were putatively identified. The feature subset may prove valuable as a rapid approach for organic basil authentication.  相似文献   

4.
A series of simple mathematical techniques for the evaluation of solvents and solvent combinations in thin-layer chromatography have been investigated. A strategy for the rapid selection of the optimum combination is proposed. It uses classification procedures based on calculation of the similarity between systems. The classification is carried out using a simple graph-theoretical procedure (Kruskal's algorithm) or numerical taxonomy. The selection of optimal sets from the clusters which appear in the classification is based on the information content as derived from Shannon's equation. The method has been applied to an RF data set for basic drugs. It is concluded that these methods indeed allow the selection of optimal systems or combination of systems.  相似文献   

5.
Partial Least Squares (PLS) is by far the most popular regression method for building multivariate calibration models for spectroscopic data. However, the success of the conventional PLS approach depends on the availability of a ‘representative data set’ as the model needs to be trained for all expected variation at the prediction stage. When the concentration of the known interferents and their correlation with the analyte of interest change in a fashion which is not covered in the calibration set, the predictive performance of inverse calibration approaches such as conventional PLS can deteriorate. This underscores the need for calibration methods that are capable of building multivariate calibration models which can be robustified against the unexpected variation in the concentrations and the correlations of the known interferents in the test set. Several methods incorporating ‘a priori’ information such as pure component spectra of the analyte of interest and/or the known interferents have been proposed to build more robust calibration models. In the present study, four such calibration techniques have been benchmarked on two data sets with respect to their predictive ability and robustness: Net Analyte Preprocessing (NAP), Improved Direct Calibration (IDC), Science Based Calibration (SBC) and Augmented Classical Least Squares (ACLS) Calibration. For both data sets, the alternative calibration techniques were found to give good prediction performance even when the interferent structure in the test set was different from the one in the calibration set. The best results were obtained by the ACLS model incorporating both the pure component spectra of the analyte of interest and the interferents, resulting in a reduction of the RMSEP by a factor 3 compared to conventional PLS for the situation when the test set had a different interferent structure than the one in the calibration set.  相似文献   

6.
The selection of an appropriate calibration set is a critical step in multivariate method development. In this work, the effect of using different calibration sets, based on a previous classification of unknown samples, on the partial least squares (PLS) regression model performance has been discussed. As an example, attenuated total reflection (ATR) mid-infrared spectra of deep-fried vegetable oil samples from three botanical origins (olive, sunflower, and corn oil), with increasing polymerized triacylglyceride (PTG) content induced by a deep-frying process were employed. The use of a one-class-classifier partial least squares-discriminant analysis (PLS-DA) and a rooted binary directed acyclic graph tree provided accurate oil classification. Oil samples fried without foodstuff could be classified correctly, independent of their PTG content. However, class separation of oil samples fried with foodstuff, was less evident. The combined use of double-cross model validation with permutation testing was used to validate the obtained PLS-DA classification models, confirming the results. To discuss the usefulness of the selection of an appropriate PLS calibration set, the PTG content was determined by calculating a PLS model based on the previously selected classes. In comparison to a PLS model calculated using a pooled calibration set containing samples from all classes, the root mean square error of prediction could be improved significantly using PLS models based on the selected calibration sets using PLS-DA, ranging between 1.06 and 2.91% (w/w).  相似文献   

7.
8.
Five algorithms for data analysis are evaluated for their abilities to discriminate against outliers in small data sets (4–10 points). These methods included least-squares regression, the least absolute -deviation method, the least median of squares method, and two techniques based on an adaptive Kalman filter. For data sets consisting of 4–9 points with one outlier, the average errors in the estimation of the slope were found to be 18.9 % by least-squares, 17.7% by the least absolute deviation method, 0.5% by the least median of squares algorithm, 9.1% by an adaptive Kalman filter algorithm, and 0.9% by a zero-lag adaptive Kalman filter algorithm. Based on these results, the conclusion is that the zero-lag adaptive Kalman filter and the least median of squares approaches are best suited for the detection of outliers in small calibration data sets.  相似文献   

9.
10.
Metal ions such as Co(II), Ni(II), Cu(II), Fe(III) and Cr(III), which are commonly present in electroplating baths at high concentrations, were analysed simultaneously by a spectrophotometric method modified by the inclusion of the ethylenediaminetetraacetate (EDTA) solution as a chromogenic reagent. The prediction of the metal ion concentrations was facilitated by the use of an orthogonal array design to build a calibration data set consisting of absorption spectra collected in the 370-760 nm range from solution mixtures containing the five metal ions earlier. With the aid of this data set, calibration models were built based on 10 different chemometrics methods such as classical least squares (CLS), principal component regression (PCR), partial least squares (PLS), artificial neural networks (ANN) and others. These were tested with the use of a validation data set constructed from synthetic solutions of the five metal ions. The analytical performance of these chemometrics methods were characterized by relative prediction errors and recoveries (%). On the basis of these results, the computational methods were ranked according to their performances using the multi-criteria decision making procedures preference ranking organization method for enrichment evaluation (PROMETHEE) and geometrical analysis for interactive aid (GAIA). PLS and PCR models applied to the spectral data matrix that used the first derivative pre-treatment were the preferred methods. They together with ANN-radial basis function (RBF) and PLS were applied for analysis of results from some typical industrial samples analysed by the EDTA-spectrophotometric method described. DPLS, DPCR and the ANN-RBF chemometrics methods performed particularly well especially when compared with some target values provided by industry.  相似文献   

11.
12.
The resolution of ternary mixtures of salicylic, salicyluric and gentisic acids has been accomplished by partial least squares (PLS) and principal component regression (PCR) multivariate calibration. The total luminescence information of the compounds has been used to optimize the spectral data set to perform the calibration. A comparison between the predictive ability of the three multivariate calibration methods, PLS-1, PLS-2 and PCR, on three spectral data sets, excitation, emission and synchronous spectra, has been performed. The excitation spectrum has been the best scanning path for salicylic and salicyluric acid determinations, while the emission spectrum has been the best for the gentisic acid determination. The convenience of analysing the total luminescence spectrum information when using multivariate calibration methods on fluorescence data is demonstrated.  相似文献   

13.
High-throughput data have been widely used in biological and medical studies to discover gene and protein functions. Due to the high dimensionality, principal component analysis (PCA) is often involved for data dimension reduction. However, when a few principal components (PCs) are selected for dimension reduction or considered for dimension determination, they are typically ranked by their variances, eigenvalues. However, this approach is not always effective in subsequent multivariate analysis, particularly classification. To maximize information from data with a subset of the components, we apply a different ranking criterion, canonical variate criterion, which considers within- and between-group variance rather than total variance in the classical criterion. Four prevalent classification methods are considered and compared using leave-one-out cross-validation. These methods are illustrated with three real high-throughput data sets, two microarray data sets and a nuclear magnetic resonance spectra data set.  相似文献   

14.
The development and in-house testing of a method for the detection and quantification of cocoa butter equivalents in cocoa butter and plain chocolate is described. A database consisting of the triacylglycerol profile of 74 genuine cocoa butter and 75 cocoa butter equivalent samples obtained by high-resolution capillary gas liquid chromatography was created, using a certified cocoa butter reference material (IRMM-801) for calibration purposes. Based on these data, a large number of cocoa butter/cocoa butter equivalent mixtures were arithmetically simulated. By subjecting the data set to various statistical tools, reliable models for both detection (univariate regression model) and quantification (multivariate model) were elaborated. Validation data sets consisting of a large number of samples (n = 4050 for detection, n = 1050 for quantification) were used to test the models. Excluding pure illipé fat samples from the data set, the detection limit was determined between 1 and 3% foreign fat in cocoa butter. Recalculated for a chocolate with a fat content of 30%, these figures are equal to 0.3-0.9% cocoa butter equivalent. For quantification, the average error for prediction was estimated to be 1.1% cocoa butter equivalent in cocoa butter, without prior knowledge of the materials used in the blend corresponding to 0.3% in chocolate (fat content 30%). The advantage of the approach is that by using IRMM-801 for calibration, the established mathematical decision rules can be transferred to every testing laboratory.  相似文献   

15.
Multivariate classification methods are needed to assist in extracting information from analytical data. The most appropriate method for each problem must be chosen. The applicability of a method mainly depends on the distributional characteristics of the data population (normality, correlations between variables, separation of classes, nature of variables) and on the characteristics of the data sample available (numbers of objects, variables and classes, missing values, measurement errors). The CLAS program is designed to combine classification methods with evaluation of their performance, for batch data processing. It incorporates two-group linear discriminant analysis (SLDA), independent class modelling with principal components (SIMCA), kernel density estimation (ALLOC), and principal component class modelling with kernel density estimation (CLASSY). Most of these methods are implemented so as to give probabilistic classifications. Multiple linear regression is provided for, and other methods are scheduled. CLAS evaluates the classification method using the training set data (resubstitution), independent test data, and pseudo test data (leave-one-out method). This last method is optimized for faster computation. Criteria for classification performance and reliability of the given probabilities, etc. are determined. The package contains flexible possibilities for data manipulation, variable transformation and missing data handling.  相似文献   

16.
We describe the application of ensemble methods to binary classification problems on two pharmaceutical compound data sets. Several variants of single and ensembles models of k-nearest neighbors classifiers, support vector machines (SVMs), and single ridge regression models are compared. All methods exhibit robust classification even when more features are given than observations. On two data sets dealing with specific properties of drug-like substances (cytochrome P450 inhibition and "Frequent Hitters", i.e., unspecific protein inhibition), we achieve classification rates above 90%. We are able to reduce the cross-validated misclassification rate for the Frequent Hitters problem by a factor of 2 compared to previous results obtained for the same data set with different modeling techniques.  相似文献   

17.
18.
In atomic absorption spectrometric measurements calibration lines are measured daily. These lines are not always acceptable. They can, for instance, contain outliers, have a bad precision or can be curved. To evaluate the quality of those lines a method which gives a fast diagnosis is recommended. In this study the use of Kohonen neural networks was examined as an automated procedure to classify these calibration lines. The results were compared with those obtained using a decision support system which uses classical statistical methods to classify the lines. The prediction capabilities of both approaches relative to a visual inspection and classification was found to be comparable, or even slightly better for the Kohonen networks, depending on the training set used. For both techniques a prediction error rate of <10% was obtained, relative to a visual classification.  相似文献   

19.
20.
In this paper, we study the classifications of unbalanced data sets of drugs. As an example we chose a data set of 2D6 inhibitors of cytochrome P450. The human cytochrome P450 2D6 isoform plays a key role in the metabolism of many drugs in the preclinical drug discovery process. We have collected a data set from annotated public data and calculated physicochemical properties with chemoinformatics methods. On top of this data, we have built classifiers based on machine learning methods. Data sets with different class distributions lead to the effect that conventional machine learning methods are biased toward the larger class. To overcome this problem and to obtain sensitive but also accurate classifiers we combine machine learning and feature selection methods with techniques addressing the problem of unbalanced classification, such as oversampling and threshold moving. We have used our own implementation of a support vector machine algorithm as well as the maximum entropy method. Our feature selection is based on the unsupervised McCabe method. The classification results from our test set are compared structurally with compounds from the training set. We show that the applied algorithms enable the effective high throughput in silico classification of potential drug candidates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号