首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures.In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one.The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks’ Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees.A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.  相似文献   

2.
Javier Galbán  Carlos Ubide 《Talanta》2007,71(3):1339-1344
The quantification step is an important source of uncertainty in analytical methods, but it is frequently misunderstood and disregarded. In this paper, it is shown how this uncertainty is closely related to the linear response range of a method, and to the Pearson correlation coefficient of the calibration line. So, if there is a need for a pre-fixed quantification uncertainty, the linear response range will be affected. Some practical cases are given showing the quantification uncertainty significance. The theoretical equation giving the value of the quantification uncertainty is deduced from which new conclusions can be taken out. Because of that, the quantification uncertainty can easily be calculated and the parameters that really affect its value are shown along the paper. Some final considerations about detection limits and two-point calibration lines are also given. The paper can also be considered a reflection on uncertainty owed to calibration and on their consequences on the analytical methodology.  相似文献   

3.
This paper proposes a new method for calibration transfer, which was specifically designed to work with isolated variables, rather than the full spectrum or spectral windows. For this purpose, a univariate procedure is initially employed to correct the spectral measurements of the secondary instrument, given a set of transfer samples. A robust regression technique is then used to obtain a model with low sensitivity with respect to the univariate correction residuals. The proposed method is employed in two case studies involving near infrared spectrometric determination of specific mass, research octane number and naphthenes in gasoline, and moisture and oil in corn. In both cases, better calibration transfer results were obtained in comparison with piecewise direct standardization (PDS). The proposed method should be of a particular value for use with application-targeted instruments that monitor only a small set of spectral variables.  相似文献   

4.
Variable selection using a genetic algorithm is combined with partial least squares (PLS) for the prediction of additive concentrations in polymer films using Fourier transform-infrared (FT-IR) spectral data. An approach using an iterative application of the genetic algorithm is proposed. This approach allows for all variables to be considered and at the same time minimizes the risk of overfitting. We demonstrate that the variables selected by the genetic algorithm are consistent with expert knowledge. This very exciting result is a convincing application that the algorithm can select correct variables in an automated fashion.  相似文献   

5.
This work presents a method of gas mixtures discrimination. The principal concept of the method is to apply measurement data provided by a combination of sensors at single time point of their temporal response as input of the discrimination models. The pattern data combinations are selected for classes of target gases based on the criterion of 100% efficient discrimination. Combinations of sensors and time points, which provide pattern data combinations in course or repeated measurements, are encoded in the form of addresses. The designer of sensor system is responsible for their selection and they are included in the software of the final instrument. The study of the method involved the discrimination of gas mixtures composed of air and single chemical: hexane, ethanol, acetone, ethyl acetate and toluene. Two sensor arrays were utilized. Each consisted of six TGS sensors of the same type. The dynamic operation of sensors was employed. As an example the stop-flow mode was chosen. The work provides the evidence of the existence of sensor combinations and time points, which are successful in discrimination of studied classes of target gases. The persistence of addresses was discussed considering the ability of sensor array to recognize analytes, variability of repeated measurement results, number of repeated measurements and a twin sets of sensors. Altogether, the validity of the method was demonstrated.  相似文献   

6.
The correlation coefficient is commonly used to evaluate the degree of linear association between two variables. However, it can be shown that a correlation coefficient very close to one might also be obtained for a clear curved relationship. Other statistical tests, like the Lack-of-fit and Mandel’s fitting test thus appear more suitable for the validation of the linear calibration model. A number of cadmium calibration curves from atomic absorption spectroscopy were assessed for their linearity. All the investigated calibration curves were characterized by a high correlation coefficient (r >0.997) and low quality coefficient (QC <5%), but the straight-line model was systematically rejected at the 95% confidence level on the basis of the Lack-of-fit and Mandel’s fitting test. Furthermore, significantly different results were achieved between a linear regression model (LRM) and a quadratic regression (QRM) model in forecasting values for mid-scale calibration standards. The results obtained with the QRM did not differ significantly from the theoretically expected value, while those obtained with the LRM were systematically biased. It was concluded that a straight-line model with a high correlation coefficient, but with a lack-of-fit, yields significantly less accurate results than its curvilinear alternative. Received: 15 January 2002 Accepted: 18 April 2002  相似文献   

7.
8.
A novel strategy of data analysis for artificial taste and odour systems is presented in this work. It is demonstrated that using a supervised method also in feature extraction phase enhances fruit juice classification capability of sensor array developed at Warsaw University of Technology. Comparison of direct processing (raw data processed by Artificial Neural Network (ANN), raw data processed by Partial Least Squares-Discriminant Analysis (PLS-DA)) and two-stage processing (Principal Components Analysis (PCA) outputs processed by ANN, PLS-DA outputs processed by ANN) is presented. It is shown that considerable increase of classification capability occurred in the case of the new method proposed by the authors.  相似文献   

9.
Li-Juan Tang  Hai-Long Wu 《Talanta》2009,79(2):260-1694
One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. To circumvent the problem, a new gene mining approach is proposed based on the similarity between probability density functions on each gene for the class of interest with respect to the others. This method allows the ascertainment of significant genes that are informative for discriminating each individual class rather than maximizing the separability of all classes. Then one can select genes containing important information about the particular subtypes of diseases. Based on the mined significant genes for individual classes, a support vector machine with local kernel transform is constructed for the classification of different diseases. The combination of the gene mining approach with support vector machine is demonstrated for cancer classification using two public data sets. The results reveal that significant genes are identified for each cancer, and the classification model shows satisfactory performance in training and prediction for both data sets.  相似文献   

10.
Du W  Gu T  Tang LJ  Jiang JH  Wu HL  Shen GL  Yu RQ 《Talanta》2011,85(3):1689-1694
As a greedy search algorithm, classification and regression tree (CART) is easily relapsing into overfitting while modeling microarray gene expression data. A straightforward solution is to filter irrelevant genes via identifying significant ones. Considering some significant genes with multi-modal expression patterns exhibiting systematic difference in within-class samples are difficult to be identified by existing methods, a strategy that unimodal transform of variables selected by interval segmentation purity (UTISP) for CART modeling is proposed. First, significant genes exhibiting varied expression patterns can be properly identified by a variable selection method based on interval segmentation purity. Then, unimodal transform is implemented to offer unimodal featured variables for CART modeling via feature extraction. Because significant genes with complex expression patterns can be properly identified and unimodal feature extracted in advance, this developed strategy potentially improves the performance of CART in combating overfitting or underfitting while modeling microarray data. The developed strategy is demonstrated using two microarray data sets. The results reveal that UTISP-based CART provides superior performance to k-nearest neighbors or CARTs coupled with other gene identifying strategies, indicating UTISP-based CART holds great promise for microarray data analysis.  相似文献   

11.
The possibility provided by Chemometrics to extract and combine (fusion) information contained in NIR and MIR spectra in order to discriminate monovarietal extra virgin olive oils according to olive cultivar (Casaliva, Leccino, Frantoio) has been investigated.Linear discriminant analysis (LDA) was applied as a classification technique on these multivariate and non-specific spectral data both separately and jointly (NIR and MIR data together).In order to ensure a more appropriate ratio between the number of objects (samples) and number of variables (absorbance at different wavenumbers), LDA was preceded either by feature selection or variable compression. For feature selection, the SELECT algorithm was used while a wavelet transform was applied for data compression.Correct classification rates obtained by cross-validation varied between 60% and 90% depending on the followed procedure. Most accurate results were obtained using the fused NIR and MIR data, with either feature selection or data compression.Chemometrical strategies applied to fused NIR and MIR spectra represent an effective method for classification of extra virgin olive oils on the basis of the olive cultivar.  相似文献   

12.
Many analytical approaches such as mass spectrometry generate large amounts of data (input variables) per sample analysed, and not all of these variables are important or related to the target output of interest. The selection of a smaller number of variables prior to sample classification is a widespread task in many research studies, where attempts are made to seek the lowest possible set of variables that are still able to achieve a high level of prediction accuracy; in other words, there is a need to generate the most parsimonious solution when the number of input variables is huge but the number of samples/objects are smaller. Here, we compare several different variable selection approaches in order to ascertain which of these are ideally suited to achieve this goal. All variable selection approaches were applied to the analysis of a common set of metabolomics data generated by Curie-point pyrolysis mass spectrometry (Py-MS), where the goal of the study was to classify the Gram-positive bacteria Bacillus. These approaches include stepwise forward variable selection, used for linear discriminant analysis (LDA); variable importance for projection (VIP) coefficient, employed in partial least squares-discriminant analysis (PLS-DA); support vector machines-recursive feature elimination (SVM-RFE); as well as the mean decrease in accuracy and mean decrease in Gini, provided by random forests (RF). Finally, a double cross-validation procedure was applied to minimize the consequence of overfitting. The results revealed that RF with its variable selection techniques and SVM combined with SVM-RFE as a variable selection method, displayed the best results in comparison to other approaches.  相似文献   

13.
Intelligent and automatic systems based on arrays of non-specific-response chemical sensors were recently developed in our laboratory. For multidetermination applications, the normal choice is an array of potentiometric sensors to generate the signal, and an artificial neural network (ANN) correctly trained to obtain the calibration model. As a great amount of information is required for the proper modelling, we proposed its automated generation by using the sequential injection analysis (SIA) technique. First signals used were steady-state: the equilibrium signal after a step-change in concentration. We have now adapted our procedures to record the transient response corresponding to a sample step. The novelty in this approach is therefore the use of the dynamic components of the signal in order to better discriminate or differentiate a sample. In the developed electronic tongue systems, detection is carried out by using a sensor array formed by five potentiometric sensors based on PVC membranes. For the developed application we employed two different chloride-selective sensors, two nitrate-selective sensors and one generic response sensor. As the amount of raw data (fivefold recordings corresponding to the five sensors) is excessive for an ANN, some feature extraction step prior to the modelling was needed. In order to attain substantial data reduction and noise filtering, the data obtained were fitted with orthonormal Legendre polynomials. In this case, a third-degree Legendre polynomial was shown to be sufficient to fit the data. The coefficients of these polynomials were the input information fed into the ANN used to model the concentrations of the determined species (Cl, and ). Best results were obtained by using a backpropagation neural network trained with the Bayesian regularisation algorithm; the net had a single hidden layer containing three neurons with the tansig transfer function. The results obtained from the time-dependent response were compared with those obtained from steady-state conditions, showing the former superior performance. Finally, the method was applied for determining anions in synthetic samples and real water samples, where a satisfactory comparison was also achieved.   相似文献   

14.
Coffee samples were analyzed by GC/MS in order to determine the most important peaks for the discrimination of the varieties Arabica and Robusta. The resulting peak tables from chromatographic analysis were aligned and pretreated before being submitted to multivariate analysis. A rapid and easy-to-perform peak alignment procedure, which does not require advanced programming skills to use, was compared with the tedious manual alignment procedure. The influence of three types of data pretreatment, normalization, logarithmic and square root transformations and their combinations, on the variables selected as most important by the regression coefficients of partial least squares-discriminant analysis (PLS-DA), are shown. Test samples different from those used in the calibration and comparison with the substances already known as being responsible for Arabica and Robusta coffees discrimination were used to determine the best pretreatments for both datasets. The data pretreatment consisting of square root transformation followed by normalization (RN) was chosen as being the most appropriate. The results obtained showed that the much quicker automated aligned method could be used as a substitute for the manually aligned method, allowing all the peaks in the chromatogram to be used for multivariate analysis.  相似文献   

15.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

16.
A gene regulatory network (GRN) is a large and complex network consisting of interacting elements that, over time, affect each other’s state. The dynamics of complex gene regulatory processes are difficult to understand using intuitive approaches alone. To overcome this problem, we propose an algorithm for inferring the regulatory interactions from knock-out data using a Gaussian model combines with Pearson Correlation Coefficient (PCC). There are several problems relating to GRN construction that have been outlined in this paper. We demonstrated the ability of our proposed method to (1) predict the presence of regulatory interactions between genes, (2) their directionality and (3) their states (activation or suppression). The algorithm was applied to network sizes of 10 and 50 genes from DREAM3 datasets and network sizes of 10 from DREAM4 datasets. The predicted networks were evaluated based on AUROC and AUPR. We discovered that high false positive values were generated by our GRN prediction methods because the indirect regulations have been wrongly predicted as true relationships. We achieved satisfactory results as the majority of sub-networks achieved AUROC values above 0.5.  相似文献   

17.
A digital chromatogram simulator has been used to evaluate the performance of data handling systems. The simulator can synthesize chromatograms with any desired shape from a library of peaks based on real peak shapes. Once generated, the chromatograms can be reproduced exactly any number of times. The algorithms used in data handling systems have a significant effect on the results they produce. An object of this work was to examine how well the true size of a peak was reported when the peak was detected overlapping with other peaks. A series of experiments was therefore performed in which peaks were overlapped by different amounts and in different situations. The reported results from the data systems were compared with the results which would have been reported had the peaks been fully resolved. A significant difference was found between the overlap results and the fully resolved results, which did not have a linear relationship with the parameters tested. The chromatogram simulator was found to be a powerful tool in assessing the performance of the data systems. It offers a valuable means of evaluating the quality of the results produced by these systems.  相似文献   

18.
19.
20.
This work compares the values of dissociation constants obtained from electrophoretic mobilities of a series of quinolones at different pH values and those obtained using absorbance spectra at the maximum of the eletrophoretic peaks. The results obtained show that the two methods are complementary and constitute a valuable means of obtaining better precision. The two methods proposed can be used simultaneously without an increase in the experimental time and allow confirmation of the results obtained.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号