期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Variable selection in multivariate calibration based on clustering of variable concept

Maryam Farrokhnia Sadegh Karimi 《Analytica chimica acta》2016

Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached. 相似文献

2.

Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression

Reinaldo F. Tefilo Joo Paulo A. Martins Mrcia M. C. Ferreira 《Journal of Chemometrics》2009,23(1):32-48

A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

3.

Variable selection by modified IPW (iterative predictor weighting)-PLS (partial least squares) in continuous wavelet regression models

Chen D Hu B Shao X Su Q 《The Analyst》2004,129(7):664-669

Variable selection is often used to produce more robust and parsimonious regression models. But when they are applied directly to the raw near-infrared spectra, it is not easy to select appropriate variables because background and noise will often overshadow or overlap the absorption bands of analyte. In this work, a new hybrid algorithm based on the selection of the most informative variables in the continuous wavelet transform (CWT) domain is described. The strategy is a combination of CWT and a procedure of modified iterative predictor weighting-partial least square (mIPW-PLS). After elimination of the background and noise in NIR spectra by CWT, the mIPW-PLS approach is used to select the most informative CWT coefficients. With the selected CWT coefficients, a PLS model is built finally for prediction. It is indicated that the extraction of most important variables in the CWT domain can effectively avoid the interference of background and noise, and result in a high quality of regression model with a very small number of variables and fewer PLS components. 相似文献

4.

Evaluating the reliability of spectral variables selected by subsampling methods

Zhaozhou Lin Xiaoning Pan Bing Xu Jiayu Zhang Xinyuan Shi Yanjiang Qiao 《Journal of Chemometrics》2015,29(2):87-95

It is imperfect to evaluate a subsampling variable selection method using only its prediction performance. To further assess the reliability of subsampling variable selection methods, dummy noise variables of different amplitudes were augmented to the original spectral data, and the false variable selection number was recorded. The reliabilities of three subsampling variable selection methods including Monte Carlo uninformative variable elimination (MC‐UVE), competitive adaptive reweighted sampling (CARS), and stability CARS (SCARS) were evaluated using this dummy noise strategy. The evaluation results indicated that both CARS and SCARS produced more parsimonious variable sets, but the reliabilities of their final variable sets were weaker than those of MC‐UVE. On the contrary, only marginal improvement on the prediction performance was obtained using MC‐UVE. Further experiments showed that removing white noise‐like variables beforehand would improve the reliability of variables extracted by CARS and SCARS. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

5.

Variable interaction network based variable selection for multivariate calibration

Rao R Lakshminarayanan S 《Analytica chimica acta》2007,599(1):24-35

Multivariate calibration problems often involve the identification of a meaningful subset of variables, from a vast number of variables for better prediction of output variables. A new graph theoretic method based on partial correlations (variable interaction network—VIN) is proposed. Many well studied representative calibration datasets spanning different application domains are selected for investigating the performance. Partial least squares (PLS) regression models combined with variable selection techniques are employed for benchmarking the performance. Subsets of variables with different number of variables are retained for the final analysis after VIN selection and progressive prediction accuracies are used for comparison. VIN-PLS results show significant improvement in prediction efficiencies and variable subset optimization. Improvement of up to 45% over existing methods with significantly fewer variables is achieved using the new method. Advantages of VIN based variable selection are highlighted. 相似文献

6.

A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration 总被引：4，自引：0，他引：4

Yong-Huan Yun Wei-Ting Wang Min-Li Tan Yi-Zeng Liang Hong-Dong Li Dong-Sheng Cao Hong-Mei Lu Qing-Song Xu 《Analytica chimica acta》2014

Nowadays, with a high dimensionality of dataset, it faces a great challenge in the creation of effective methods which can select an optimal variables subset. In this study, a strategy that considers the possible interaction effect among variables through random combinations was proposed, called iteratively retaining informative variables (IRIV). Moreover, the variables are classified into four categories as strongly informative, weakly informative, uninformative and interfering variables. On this basis, IRIV retains both the strongly and weakly informative variables in every iterative round until no uninformative and interfering variables exist. Three datasets were employed to investigate the performance of IRIV coupled with partial least squares (PLS). The results show that IRIV is a good alternative for variable selection strategy when compared with three outstanding and frequently used variable selection methods such as genetic algorithm-PLS, Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS) and competitive adaptive reweighted sampling (CARS). The MATLAB source code of IRIV can be freely downloaded for academy research at the website: http://code.google.com/p/multivariate-calibration/downloads/list. 相似文献

7.

Bayesian linear regression and variable selection for spectroscopic calibration 总被引：2，自引：0，他引：2

Tao Chen Elaine Martin 《Analytica chimica acta》2009,631(1):13-4221

This paper presents a Bayesian approach to the development of spectroscopic calibration models. By formulating the linear regression in a probabilistic framework, a Bayesian linear regression model is derived, and a specific optimization method, i.e. Bayesian evidence approximation, is utilized to estimate the model “hyper-parameters”. The relation of the proposed approach to the calibration models in the literature is discussed, including ridge regression and Gaussian process model. The Bayesian model may be modified for the calibration of multivariate response variables. Furthermore, a variable selection strategy is implemented within the Bayesian framework, the motivation being that the predictive performance may be improved by selecting a subset of the most informative spectral variables. The Bayesian calibration models are applied to two spectroscopic data sets, and they demonstrate improved prediction results in comparison with the benchmark method of partial least squares. 相似文献

8.

Reproducibility,complementary measure of predictability for robustness improvement of multivariate calibration models via variable selections

Hae Woo Lee Andrew BawnSeongkyu Yoon 《Analytica chimica acta》2012

In multivariate calibration with the spectral dataset, variable selection is often applied to identify relevant subset of variables, leading to improved prediction accuracy and easy interpretation of the selected fingerprint regions. Until now, numerous variable selection methods have been proposed, but a proper choice among them is not trivial. Furthermore, in many cases, a set of variables found by those methods might not be robust due to the irreproducibility and uncertainty issues, posing a great challenge in improving the reliability of the variable selection. In this study, the reproducibility of the 5 variable selection methods was investigated quantitatively for evaluating their performance. The reproducibility of variable selection was quantified by using Monte-Carlo sub-sampling (MCS) techniques together with the quantitative similarity measure designed for the highly collinear spectral dataset. The investigation of reproducibility and prediction accuracy of the several variable selection algorithms with two different near-infrared (NIR) datasets illustrated that the different variable selection methods exhibited wide variability in their performance, especially in their capabilities to identify the consistent subset of variables from the spectral datasets. Thus the thorough assessment of the reproducibility together with the predictive accuracy of the identified variables improved the statistical validity and confidence of the selection outcome, which cannot be addressed by the conventional evaluation schemes. 相似文献

9.

离散小波变换-遗传算法-交互检验法用于近红外光谱数据的高倍压缩与变量筛选 总被引：11，自引：0，他引：11

王国庆邵学广《分析化学》2005,33(2):191-194

用遗传算法(GA)与交互检验(CV)相结合建立了一种用于对近红外光谱(NIR)数据及其离散小波变换(DWT)系数进行变量筛选的方法，并应用于烟草样品中总挥发碱和总氮的同时测定。结果表明：NIR数据经DWT压缩为原始大小的3．3％时基本没有光谱信息的丢失；有效的变量筛选可以极大地减少模型中的变量个数，降低模型的复杂程度，改善预测的准确度。相似文献

10.

Genetic Algorithm guided Selection: variable selection and subset selection 总被引：3，自引：0，他引：3

Cho SJ Hermsmeier MA 《Journal of chemical information and computer sciences》2002,42(4):927-936

相似文献

11.

Determination of α-linolenic acid and linoleic acid in edible oils using near-infrared spectroscopy improved by wavelet transform and uninformative variable elimination 总被引：1，自引：0，他引：1

Di Wu Xiaojing Chen Pinyan Shi Fengqin Feng Yong He 《Analytica chimica acta》2009,634(2):166-171

This paper proposes an analytical method for simultaneous near-infrared (NIR) spectrometric determination of α-linolenic and linoleic acid in eight types of edible vegetable oils and their blending. For this purpose, a combination of spectral wavelength selection by wavelet transform (WT) and elimination of uninformative variables (UVE) was proposed to obtain simple partial least square (PLS) models based on a small subset of wavelengths. WT was firstly utilized to compress full NIR spectra which contain 1413 redundant variables, and 42 wavelet approximate coefficients were obtained. UVE was then carried out to further select the informative variables. Finally, 27 and 19 wavelet approximate coefficients were selected by UVE for α-linolenic and linoleic acid, respectively. The selected variables were used as inputs of PLS model. Due to original spectra were compressed, and irrelevant variables were eliminated, more parsimonious and efficient model based on WT-UVE was obtained compared with the conventional PLS model with full spectra data. The coefficient of determination (r²) and root mean square error prediction set (RMSEP) for prediction set were 0.9345 and 0.0123 for α-linolenic acid prediction by WT-UVE-PLS model. The r² and RMSEP were 0.9054, 0.0437 for linoleic acid prediction. The good performance showed a potential application using WT-UVE to select NIR effective variables. WT-UVE can both speed up the calculation and improve the predicted results. The results indicated that it was feasible to fast determine α-linolenic acid and linoleic acid content in edible oils using NIR spectroscopy. 相似文献

12.

Modelling the quality of enantiomeric separations using Mutual Information as an alternative variable selection technique

Caetano S Krier C Verleysen M Vander Heyden Y 《Analytica chimica acta》2007,602(1):37-46

This paper uses Mutual Information as an alternative variable selection method for quantitative structure-property relationships data. To evaluate the performance of this criterion, the enantioselectivity of 67 molecules, in three different chiral stationary phases, is modelled. Partial Least Squares together with three commonly used variable selection techniques was evaluated and then compared with the results obtained when using Mutual Information together with Support Vector Machines. The results show not only that variable selection is a necessary step in quantitative structure-property relationship modelling, but also that Mutual Information associated with Support Vector Machines is a valuable alternative to Partial Least Squares together with correlation between the explanatory and the response variables or Genetic Algorithms. This study also demonstrates that by producing models that use a rather small set of variables the interpretation can be also be improved. 相似文献

13.

A Variable Selection Method of Near Infrared Spectroscopy Based on Automatic Weighting Variable Combination Population Analysis

Huan ZHAO Ke-Wei HUAN Xiao-Guang SHI Feng ZHENG Li-Ying LIU Wei LIU Chun-Ying ZHAO 《分析化学》2018,46(1):136-142

Near-infrared spectroscopy (NIR) is widely used in food quantitative and qualitative analysis. Variable selection technique is a critical step of the spectrum modeling with the development of chemometrics. In this study, a novel variable selection strategy, automatic weighting variable combination population analysis (AWVCPA), is proposed. Firstly, binary matrix sampling (BMS) strategy, which provides each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, the variable frequency (Fre) and partial least squares regression (Reg), two kinds of information vector (IVs), are weighted to obtain the value of the contribution of each spectral variables, and the influence of two IVs of Rre and Reg is considered to each spectral variable. Finally, it uses the exponentially decreasing function (EDF) to remove the low contribution wavelengths so as to select the characteristic variables. In the case of near infrared spectra of beer and corn, yeast and oil concentration models based on partial least squares (PLS) of prediction are established. Compared with other variable selection methods, the research shows that AWVCPA is the best variable selection strategy in the same situation. It has 72.7% improvement comparing AWVCPA-PLS to PLS and the predicted root mean square error (RMSEP) decreases from 0.5348 to 0.1457 on beer dataset. Also it has 64.7% improvement comparing AWVCPA-PLS to PLS and the RMSEP decreases from 0.0702 to 0.0248 on corn dataset. 相似文献

14.

Prediction of retention indices for frequently reported compounds of plant essential oils using multiple linear regression,partial least squares,and support vector machine

Jun Yan Jian‐Hua Huang Min He Hong‐Bing Lu Rui Yang Bo Kong Qing‐Song Xu Yi‐Zeng Liang 《Journal of separation science》2013,36(15):2464-2471

相似文献

15.

Using variable combination population analysis for variable selection in multivariate calibration

Yong-Huan Yun Wei-Ting Wang Bai-Chuan Deng Guang-Bi Lai Xin-bo Liu Da-Bing Ren Yi-Zeng Liang Wei Fan Qing-Song Xu 《Analytica chimica acta》2015

Variable (wavelength or feature) selection techniques have become a critical step for the analysis of datasets with high number of variables and relatively few samples. In this study, a novel variable selection strategy, variable combination population analysis (VCPA), was proposed. This strategy consists of two crucial procedures. First, the exponentially decreasing function (EDF), which is the simple and effective principle of ‘survival of the fittest’ from Darwin’s natural evolution theory, is employed to determine the number of variables to keep and continuously shrink the variable space. Second, in each EDF run, binary matrix sampling (BMS) strategy that gives each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, model population analysis (MPA) is employed to find the variable subsets with the lower root mean squares error of cross validation (RMSECV). The frequency of each variable appearing in the best 10% sub-models is computed. The higher the frequency is, the more important the variable is. The performance of the proposed procedure was investigated using three real NIR datasets. The results indicate that VCPA is a good variable selection strategy when compared with four high performing variable selection methods: genetic algorithm–partial least squares (GA–PLS), Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS), competitive adaptive reweighted sampling (CARS) and iteratively retains informative variables (IRIV). The MATLAB source code of VCPA is available for academic research on the website: http://www.mathworks.com/matlabcentral/fileexchange/authors/498750. 相似文献

16.

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification

Davide Ballabio Viviana Consonni Andrea Mauri Roberto Todeschini 《Analytica chimica acta》2010,657(2):116-122

In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures.In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one.The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks’ Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees.A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging. 相似文献

17.

Construction of stable multivariate calibration models using unsupervised segmented principal component regression

Bahram Hemmateenejad Sadegh Karimi 《Journal of Chemometrics》2011,25(4):139-150

In multivariate spectral calibration by principal component regression (PCR), the principal components (PCs) are calculated from the response data measured at all employed instrument channels; however some channels are redundant and their responses do not possess useful information. Thus, the extracted PCs possess mixed information from both useful and redundant channels. In this work, we propose a segmentation approach based on unsupervised pattern recognition to identify the most informative spectral region and then to construct a stable multivariate calibration model by PCR. In this method, the instrument channels are clustered into different segments via Kohonen self‐organization map. The spectral data of each segment are then subjected to PCA and the derived PCs are used as input variables for an inverse least square (ILS) regression model employing stepwise selection of the informative PCs. The proposed method was evaluated by the analysis of four simulated and six experimental data sets. It was found that our proposed method can model the above data sets with prediction errors lower than conventional partial least squares (PLS) and PCR methods. In addition, the prediction ability of our method was better than the previously reported models for these data sets. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

18.

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features

Cao D Liang Y Xu Q Yun Y Li H 《Journal of computer-aided molecular design》2011,25(1):67-80

相似文献

19.

Comparison of ridge regression, partial least-squares, pairwise correlation, forward- and best subset selection methods for prediction of retention indices for aliphatic alcohols

Farkas O Héberger K 《Journal of chemical information and modeling》2005,45(2):339-346

相似文献

20.

Genetic algorithm optimisation combined with partial least squares regression and mutual information variable selection procedures in near-infrared quantitative analysis of cotton-viscose textiles

Durand A Devos O Ruckebusch C Huvenne JP 《Analytica chimica acta》2007,595(1-2):72-79

In this work, different approaches for variable selection are studied in the context of near-infrared (NIR) multivariate calibration of textile. First, a model-based regression method is proposed. It consists in genetic algorithm optimisation combined with partial least squares regression (GA-PLS). The second approach is a relevance measure of spectral variables based on mutual information (MI), which can be performed independently of any given regression model. As MI makes no assumption on the relationship between X and Y, non-linear methods such as feed-forward artificial neural network (ANN) are thus encouraged for modelling in a prediction context (MI-ANN). GA-PLS and MI-ANN models are developed for NIR quantitative prediction of cotton content in cotton-viscose textile samples. The results are compared to full-spectrum (480 variables) PLS model (FS-PLS). The model requires 11 latent variables and yielded a 3.74% RMS prediction error in the range 0-100%. GA-PLS provides more robust model based on 120 variables and slightly enhanced prediction performance (3.44% RMS error). Considering MI variable selection procedure, great improvement can be obtained as 12 variables only are retained. On the basis of these variables, a 12 inputs ANN model is trained and the corresponding prediction error is 3.43% RMS error. 相似文献