期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression

Reinaldo F. Tefilo Joo Paulo A. Martins Mrcia M. C. Ferreira 《Journal of Chemometrics》2009,23(1):32-48

A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

2.

Envirometrics. Part I: Modeling of water salinity and air quality data

Braibanti A Gollapalli NR Jonnalagaddaj SB Duvvuru S Rupenaguntla SR 《Annali di chimica》2001,91(1-2):29-39

Envirometrics utilises advanced mathematical, statistical and information tools to extract information. Two typical environmental data sets are analysed using MVATOB (Multi Variate Analysis TOol Box). The first data set corresponds to the variable river salinity. Least median squares (LMS) detected the outliers whereas linear least squares (LLS) could not detect and remove the outliers. The second data set consists of daily readings of air quality values. Outliers are detected by LMS and unbiased regression coefficients are estimated by multi-linear regression (MLR). As explanatory variables are not independent, principal component regression (PCR) and partial least squares regression (PLSR) are used. Both examples demonstrate the superiority of LMS over LLS. 相似文献

3.

Bayesian linear regression and variable selection for spectroscopic calibration 总被引：2，自引：0，他引：2

Tao Chen Elaine Martin 《Analytica chimica acta》2009,631(1):13-4221

This paper presents a Bayesian approach to the development of spectroscopic calibration models. By formulating the linear regression in a probabilistic framework, a Bayesian linear regression model is derived, and a specific optimization method, i.e. Bayesian evidence approximation, is utilized to estimate the model “hyper-parameters”. The relation of the proposed approach to the calibration models in the literature is discussed, including ridge regression and Gaussian process model. The Bayesian model may be modified for the calibration of multivariate response variables. Furthermore, a variable selection strategy is implemented within the Bayesian framework, the motivation being that the predictive performance may be improved by selecting a subset of the most informative spectral variables. The Bayesian calibration models are applied to two spectroscopic data sets, and they demonstrate improved prediction results in comparison with the benchmark method of partial least squares. 相似文献

4.

Variable interaction network based variable selection for multivariate calibration

Rao R Lakshminarayanan S 《Analytica chimica acta》2007,599(1):24-35

Multivariate calibration problems often involve the identification of a meaningful subset of variables, from a vast number of variables for better prediction of output variables. A new graph theoretic method based on partial correlations (variable interaction network—VIN) is proposed. Many well studied representative calibration datasets spanning different application domains are selected for investigating the performance. Partial least squares (PLS) regression models combined with variable selection techniques are employed for benchmarking the performance. Subsets of variables with different number of variables are retained for the final analysis after VIN selection and progressive prediction accuracies are used for comparison. VIN-PLS results show significant improvement in prediction efficiencies and variable subset optimization. Improvement of up to 45% over existing methods with significantly fewer variables is achieved using the new method. Advantages of VIN based variable selection are highlighted. 相似文献

5.

Influence of variable selection on partial least squares discriminant analysis models for explosive residue classification

Frank C. De Lucia Jr. Jennifer L. Gottfried 《Spectrochimica Acta Part B: Atomic Spectroscopy》2011,66(2):122-128

Using a series of thirteen organic materials that includes novel high-nitrogen energetic materials, conventional organic military explosives, and benign organic materials, we have demonstrated the importance of variable selection for maximizing residue discrimination with partial least squares discriminant analysis (PLS-DA). We built several PLS-DA models using different variable sets based on laser induced breakdown spectroscopy (LIBS) spectra of the organic residues on an aluminum substrate under an argon atmosphere. The model classification results for each sample are presented and the influence of the variables on these results is discussed. We found that using the whole spectra as the data input for the PLS-DA model gave the best results. However, variables due to the surrounding atmosphere and the substrate contribute to discrimination when the whole spectra are used, indicating this may not be the most robust model. Further iterative testing with additional validation data sets is necessary to determine the most robust model. 相似文献

6.

Modeling based on subspace orthogonal projections for QSAR and QSPR research

Yizeng Liang Dalin Yuan Qingsong Xu Olav Martin Kvalheim 《Journal of Chemometrics》2008,22(1):23-35

A novel projection modeling method for quantitative structure activity relationship (QSAR) and quantitative structure property relationship (QSPR) is developed in this paper. Orthogonalization of block variables is introduced to deal with the problem of variable selection. Projections based on least squares are used to construct the modeling space in order to search for the best regression directions for chemical modeling. A suitable prediction space for such a model is further defined to confine the usage range of the model. Three real data sets were analyzed to check the performance of the proposed modeling method. The results obtained from Monte‐Carlo cross‐validation (MCCV) showed that the proposed modeling method might provide better results for QSAR and QSPR modeling than PCR and PLS with respect to both fitting and prediction abilities. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

7.

Optimized block-wise variable combination by particle swarm optimization for partial least squares modeling in quantitative structure-activity relationship studies

Lin WQ Jiang JH Shen Q Shen GL Yu RQ 《Journal of chemical information and modeling》2005,45(2):486-493

相似文献

8.

Orthogonal signal correction, wavelet analysis, and multivariate calibration of complicated process fluorescence data 总被引：2，自引：0，他引：2

Lennart Eriksson Johan Trygg Erik Johansson Rasmus Bro Svante Wold 《Analytica chimica acta》2000,420(2):625-195

In this paper, multivariate calibration of complicated process fluorescence data is presented. Two data sets related to the production of white sugar are investigated. The first data set comprises 106 observations and 571 spectral variables, and the second data set 268 observations and 3997 spectral variables. In both applications, a single response, ash content, is modelled and predicted as a function of the spectral variables. Both data sets contain certain features making multivariate calibration efforts non-trivial. The objective is to show how principal component analysis (PCA) and partial least squares (PLS) regression can be used to overview the data sets and to establish predictively sound regression models. It is shown how a recently developed technique for signal filtering, orthogonal signal correction (OSC), can be applied in multivariate calibration to enhance predictive power. In addition, signal compression is tested on the larger data set using wavelet analysis. It is demonstrated that a compression down to 4% of the original matrix size — in the variable direction — is possible without loss of predictive power. It is concluded that the combination of OSC for pre-processing and wavelet analysis for compression of spectral data is promising for future use. 相似文献

9.

Evaluation of Multivariate Calibration Using a Tikhonov Regularization Approach and the Generalized Pair‐Correlation Method with Nonlinear Data

《Analytical letters》2012,45(6):1227-1251

Abstract

In order to reduce data nonlinearity and overfitting with the multivariate calibration model y=Xb, a modified Tikhonov regularization (TR) algorithm is evaluated for selecting key variables from an X augmented with extra columns that contain the original measured variables (x _ij) as squared terms (x _ij ²) and other orders. The TR approach simultaneously develops the multivariate calibration model. The new generalized pair‐correlation method (GPCM) is also studied for variable selection followed by partial least squares (PLS) for multivariate calibration. Results from synthetic spectral data are compared when using the modified TR approach, GPCM, and PLS without variable selection. The GPCM usually performs slightly better than the TR approach for tabulated bias and variance measures and in some cases, at a sacrifice to parsimony. The method of PLS without variable selection performs the worst. By using synthetic spectral data sets, how the methods work could be studied. Thus, results from this study will aid investigators of real spectral data sets exhibiting nonlinear behavior. 相似文献

10.

Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation

Mireia Farrs Stefan Platikanov Stefan Tsakovski Rom Tauler 《Journal of Chemometrics》2015,29(10):528-536

相似文献

11.

Construction of stable multivariate calibration models using unsupervised segmented principal component regression

Bahram Hemmateenejad Sadegh Karimi 《Journal of Chemometrics》2011,25(4):139-150

In multivariate spectral calibration by principal component regression (PCR), the principal components (PCs) are calculated from the response data measured at all employed instrument channels; however some channels are redundant and their responses do not possess useful information. Thus, the extracted PCs possess mixed information from both useful and redundant channels. In this work, we propose a segmentation approach based on unsupervised pattern recognition to identify the most informative spectral region and then to construct a stable multivariate calibration model by PCR. In this method, the instrument channels are clustered into different segments via Kohonen self‐organization map. The spectral data of each segment are then subjected to PCA and the derived PCs are used as input variables for an inverse least square (ILS) regression model employing stepwise selection of the informative PCs. The proposed method was evaluated by the analysis of four simulated and six experimental data sets. It was found that our proposed method can model the above data sets with prediction errors lower than conventional partial least squares (PLS) and PCR methods. In addition, the prediction ability of our method was better than the previously reported models for these data sets. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

12.

Opening the kernel of kernel partial least squares and support vector machines

Postma GJ Krooshof PW Buydens LM 《Analytica chimica acta》2011,705(1-2):123-134

Kernel partial least squares (KPLS) and support vector regression (SVR) have become popular techniques for regression of complex non-linear data sets. The modeling is performed by mapping the data in a higher dimensional feature space through the kernel transformation. The disadvantage of such a transformation is, however, that information about the contribution of the original variables in the regression is lost. In this paper we introduce a method which can retrieve and visualize the contribution of the variables to the regression model and the way the variables contribute to the regression of complex data sets. The method is based on the visualization of trajectories using so-called pseudo samples representing the original variables in the data. We test and illustrate the proposed method to several synthetic and real benchmark data sets. The results show that for linear and non-linear regression models the important variables were identified with corresponding linear or non-linear trajectories. The results were verified by comparing with ordinary PLS regression and by selecting those variables which were indicated as important and rebuilding a model with only those variables. 相似文献

13.

Statistical modeling of a ligand knowledge base

Mansson RA Welsh AH Fey N Orpen AG 《Journal of chemical information and modeling》2006,46(6):2591-2600

相似文献

14.

A Variable Selection Method of Near Infrared Spectroscopy Based on Automatic Weighting Variable Combination Population Analysis

Huan ZHAO Ke-Wei HUAN Xiao-Guang SHI Feng ZHENG Li-Ying LIU Wei LIU Chun-Ying ZHAO 《分析化学》2018,46(1):136-142

Near-infrared spectroscopy (NIR) is widely used in food quantitative and qualitative analysis. Variable selection technique is a critical step of the spectrum modeling with the development of chemometrics. In this study, a novel variable selection strategy, automatic weighting variable combination population analysis (AWVCPA), is proposed. Firstly, binary matrix sampling (BMS) strategy, which provides each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, the variable frequency (Fre) and partial least squares regression (Reg), two kinds of information vector (IVs), are weighted to obtain the value of the contribution of each spectral variables, and the influence of two IVs of Rre and Reg is considered to each spectral variable. Finally, it uses the exponentially decreasing function (EDF) to remove the low contribution wavelengths so as to select the characteristic variables. In the case of near infrared spectra of beer and corn, yeast and oil concentration models based on partial least squares (PLS) of prediction are established. Compared with other variable selection methods, the research shows that AWVCPA is the best variable selection strategy in the same situation. It has 72.7% improvement comparing AWVCPA-PLS to PLS and the predicted root mean square error (RMSEP) decreases from 0.5348 to 0.1457 on beer dataset. Also it has 64.7% improvement comparing AWVCPA-PLS to PLS and the RMSEP decreases from 0.0702 to 0.0248 on corn dataset. 相似文献

15.

Robust linear regression taking into account errors in the predictor and response variables 总被引：1，自引：0，他引：1

del Río FJ Riu J Rius FX 《The Analyst》2001,126(7):1113-1117

We developed a robust regression technique that is a generalization of the least median of squares (LMS) technique to the field in which the errors in both the predictor and the response variables are taken into account. This simple generalization is limited in the sense that the resulting straight line is found by using only two points from the initial data set. In this way a simulation step is added by using the Monte Carlo method to generate the best robust regression line. We call this new technique 'bivariate least median of squares' (BLMS), following the notation of the LMS method. We checked the robustness of the new regression technique by calculating its breakdown point, which was 50%. This confirms the robustness of the BLMS regression line. In order to show its applicability to the chemical field we tested it on simulated data sets and real data sets with outliers. The BLMS robust regression line was not affected by many types of outlying points in the data sets. 相似文献

16.

Genetic algorithm optimisation combined with partial least squares regression and mutual information variable selection procedures in near-infrared quantitative analysis of cotton-viscose textiles

Durand A Devos O Ruckebusch C Huvenne JP 《Analytica chimica acta》2007,595(1-2):72-79

In this work, different approaches for variable selection are studied in the context of near-infrared (NIR) multivariate calibration of textile. First, a model-based regression method is proposed. It consists in genetic algorithm optimisation combined with partial least squares regression (GA-PLS). The second approach is a relevance measure of spectral variables based on mutual information (MI), which can be performed independently of any given regression model. As MI makes no assumption on the relationship between X and Y, non-linear methods such as feed-forward artificial neural network (ANN) are thus encouraged for modelling in a prediction context (MI-ANN). GA-PLS and MI-ANN models are developed for NIR quantitative prediction of cotton content in cotton-viscose textile samples. The results are compared to full-spectrum (480 variables) PLS model (FS-PLS). The model requires 11 latent variables and yielded a 3.74% RMS prediction error in the range 0-100%. GA-PLS provides more robust model based on 120 variables and slightly enhanced prediction performance (3.44% RMS error). Considering MI variable selection procedure, great improvement can be obtained as 12 variables only are retained. On the basis of these variables, a 12 inputs ANN model is trained and the corresponding prediction error is 3.43% RMS error. 相似文献

17.

ST‐PLS: a multi‐directional nearest shrunken centroid type classifier via PLS

Solve Sb Trygve Almy Jrgen Aare Are H. Aastveit 《Journal of Chemometrics》2008,22(1):54-62

The nearest shrunken centroid (NSC) Classifier is successfully applied for class prediction in a wide range of studies based on microarray data. The contribution from seemingly irrelevant variables to the classifier is minimized by the so‐called soft‐thresholding property of the approach. In this paper, we first show that for the two‐class prediction problem, the NSC Classifier is similar to a one‐component discriminant partial least squares (PLS) model with soft‐shrinkage of the loading weights. Then we introduce the soft‐threshold‐PLS (ST‐PLS) as a general discriminant‐PLS model with soft‐thresholding of the loading weights of multiple latent components. This method is especially suited for classification and variable selection when the number of variables is large compared to the number of samples, which is typical for gene expression data. A characteristic feature of ST‐PLS is the ability to identify important variables in multiple directions in the variable space. Both the ST‐PLS and the NSC classifiers are applied to four real data sets. The results indicate that ST‐PLS performs better than the shrunken centroid approach if there are several directions in the variable space which are important for classification, and there are strong dependencies between subsets of variables. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

18.

Predictive-property-ranked variable reduction in partial least squares modelling with final complexity adapted models: Comparison of properties for ranking

Jan P.M. Andries Yvan Vander Heyden Lutgarde M.C. Buydens 《Analytica chimica acta》2013

The calibration performance of partial least squares regression for one response (PLS1) can be improved by eliminating uninformative variables. Many variable-reduction methods are based on so-called predictor-variable properties or predictive properties, which are functions of various PLS-model parameters, and which may change during the steps of the variable-reduction process. Recently, a new predictive-property-ranked variable reduction method with final complexity adapted models, denoted as PPRVR-FCAM or simply FCAM, was introduced. It is a backward variable elimination method applied on the predictive-property-ranked variables. The variable number is first reduced, with constant PLS1 model complexity A, until A variables remain, followed by a further decrease in PLS complexity, allowing the final selection of small numbers of variables. 相似文献

19.

Near Infrared Spectral Similarity Combined with Variable Selection Method in the Quality Control of Flos Lonicerae: A Preliminary Study

Ni Xin Qinghua Meng Yizhen Li Yuzhu Hu 《中国化学》2011,29(11):2533-2540

This paper indicates the possibility to use near infrared (NIR) spectral similarity as a rapid method to estimate the quality of Flos Lonicerae. Variable selection together with modelling techniques is utilized to select representative variables that are used to calculate the similarity. NIR is used to build calibration models to predict the bacteriostatic activity of Flos Lonicerae. For the determination of the bacteriostatic activity, the in vitro experiment is used. Models are built for the Gram‐positive bacteria and also for the Gram‐negative bacteria. A genetic algorithm combined with partial least squares regression (GA‐PLS) is used to perform the calibration. The results of GA‐PLS models are compared to interval partial least squares (iPLS) models, full‐spectrum PLS and full‐spectrum principal component regression (PCR) models. Then, the variables in the two GA‐PLS models are combined and then used to calculate the NIR spectral similarity of samples. The similarity based on the characteristic variables and full spectrum is used for evaluating the fingerprints of Flos Lonicerae, respectively. The results show that the combination of variable selection method, modelling techniques and similarity analysis might be a powerful tool for quality control of traditional Chinese medicine (TCM). 相似文献

20.

Computational performance and cross‐validation error precision of five PLS algorithms using designed and real data sets

Joo Paulo A. Martins Reinaldo F. Tefilo Mrcia M. C. Ferreira 《Journal of Chemometrics》2010,24(6):320-332

An evaluation of computational performance and precision regarding the cross‐validation error of five partial least squares (PLS) algorithms (NIPALS, modified NIPALS, Kernel, SIMPLS and bidiagonal PLS), available and widely used in the literature, is presented. When dealing with large data sets, computational time is an important issue, mainly in cross‐validation and variable selection. In the present paper, the PLS algorithms are compared in terms of the run time and the relative error in the precision obtained when performing leave‐one‐out cross‐validation using simulated and real data sets. The simulated data sets were investigated through factorial and Latin square experimental designs. The evaluations were based on the number of rows, the number of columns and the number of latent variables. With respect to their performance, the results for both simulated and real data sets have shown that the differences in run time are statistically different. PLS bidiagonal is the fastest algorithm, followed by Kernel and SIMPLS. Regarding cross‐validation error, all algorithms showed similar results. However, in some situations as, for example, when many latent variables were in question, discrepancies were observed, especially with respect to SIMPLS. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献