共查询到20条相似文献,搜索用时 15 毫秒
1.
Multivariate calibration problems often involve the identification of a meaningful subset of variables, from a vast number of variables for better prediction of output variables. A new graph theoretic method based on partial correlations (variable interaction network—VIN) is proposed. Many well studied representative calibration datasets spanning different application domains are selected for investigating the performance. Partial least squares (PLS) regression models combined with variable selection techniques are employed for benchmarking the performance. Subsets of variables with different number of variables are retained for the final analysis after VIN selection and progressive prediction accuracies are used for comparison. VIN-PLS results show significant improvement in prediction efficiencies and variable subset optimization. Improvement of up to 45% over existing methods with significantly fewer variables is achieved using the new method. Advantages of VIN based variable selection are highlighted. 相似文献
2.
Yong-Huan Yun Wei-Ting Wang Bai-Chuan Deng Guang-Bi Lai Xin-bo Liu Da-Bing Ren Yi-Zeng Liang Wei Fan Qing-Song Xu 《Analytica chimica acta》2015
Variable (wavelength or feature) selection techniques have become a critical step for the analysis of datasets with high number of variables and relatively few samples. In this study, a novel variable selection strategy, variable combination population analysis (VCPA), was proposed. This strategy consists of two crucial procedures. First, the exponentially decreasing function (EDF), which is the simple and effective principle of ‘survival of the fittest’ from Darwin’s natural evolution theory, is employed to determine the number of variables to keep and continuously shrink the variable space. Second, in each EDF run, binary matrix sampling (BMS) strategy that gives each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, model population analysis (MPA) is employed to find the variable subsets with the lower root mean squares error of cross validation (RMSECV). The frequency of each variable appearing in the best 10% sub-models is computed. The higher the frequency is, the more important the variable is. The performance of the proposed procedure was investigated using three real NIR datasets. The results indicate that VCPA is a good variable selection strategy when compared with four high performing variable selection methods: genetic algorithm–partial least squares (GA–PLS), Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS), competitive adaptive reweighted sampling (CARS) and iteratively retains informative variables (IRIV). The MATLAB source code of VCPA is available for academic research on the website: http://www.mathworks.com/matlabcentral/fileexchange/authors/498750. 相似文献
3.
This paper presents a Bayesian approach to the development of spectroscopic calibration models. By formulating the linear regression in a probabilistic framework, a Bayesian linear regression model is derived, and a specific optimization method, i.e. Bayesian evidence approximation, is utilized to estimate the model “hyper-parameters”. The relation of the proposed approach to the calibration models in the literature is discussed, including ridge regression and Gaussian process model. The Bayesian model may be modified for the calibration of multivariate response variables. Furthermore, a variable selection strategy is implemented within the Bayesian framework, the motivation being that the predictive performance may be improved by selecting a subset of the most informative spectral variables. The Bayesian calibration models are applied to two spectroscopic data sets, and they demonstrate improved prediction results in comparison with the benchmark method of partial least squares. 相似文献
4.
In multivariate calibration with the spectral dataset, variable selection is often applied to identify relevant subset of variables, leading to improved prediction accuracy and easy interpretation of the selected fingerprint regions. Until now, numerous variable selection methods have been proposed, but a proper choice among them is not trivial. Furthermore, in many cases, a set of variables found by those methods might not be robust due to the irreproducibility and uncertainty issues, posing a great challenge in improving the reliability of the variable selection. In this study, the reproducibility of the 5 variable selection methods was investigated quantitatively for evaluating their performance. The reproducibility of variable selection was quantified by using Monte-Carlo sub-sampling (MCS) techniques together with the quantitative similarity measure designed for the highly collinear spectral dataset. The investigation of reproducibility and prediction accuracy of the several variable selection algorithms with two different near-infrared (NIR) datasets illustrated that the different variable selection methods exhibited wide variability in their performance, especially in their capabilities to identify the consistent subset of variables from the spectral datasets. Thus the thorough assessment of the reproducibility together with the predictive accuracy of the identified variables improved the statistical validity and confidence of the selection outcome, which cannot be addressed by the conventional evaluation schemes. 相似文献
5.
A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration 总被引:4,自引:0,他引:4
Yong-Huan Yun Wei-Ting Wang Min-Li Tan Yi-Zeng Liang Hong-Dong Li Dong-Sheng Cao Hong-Mei Lu Qing-Song Xu 《Analytica chimica acta》2014
Nowadays, with a high dimensionality of dataset, it faces a great challenge in the creation of effective methods which can select an optimal variables subset. In this study, a strategy that considers the possible interaction effect among variables through random combinations was proposed, called iteratively retaining informative variables (IRIV). Moreover, the variables are classified into four categories as strongly informative, weakly informative, uninformative and interfering variables. On this basis, IRIV retains both the strongly and weakly informative variables in every iterative round until no uninformative and interfering variables exist. Three datasets were employed to investigate the performance of IRIV coupled with partial least squares (PLS). The results show that IRIV is a good alternative for variable selection strategy when compared with three outstanding and frequently used variable selection methods such as genetic algorithm-PLS, Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS) and competitive adaptive reweighted sampling (CARS). The MATLAB source code of IRIV can be freely downloaded for academy research at the website: http://code.google.com/p/multivariate-calibration/downloads/list. 相似文献
6.
Jan Gerretzen Ewa Szymańska Jacob Bart Antony N. Davies Henk-Jan van Manen Edwin R. van den Heuvel Jeroen J. Jansen Lutgarde M.C. Buydens 《Analytica chimica acta》2016
The aim of data preprocessing is to remove data artifacts—such as a baseline, scatter effects or noise—and to enhance the contextually relevant information. Many preprocessing methods exist to deliver one or more of these benefits, but which method or combination of methods should be used for the specific data being analyzed is difficult to select. Recently, we have shown that a preprocessing selection approach based on Design of Experiments (DoE) enables correct selection of highly appropriate preprocessing strategies within reasonable time frames. 相似文献
7.
In this study, an algorithm for growing neural networks is proposed. Starting with an empty network the algorithm reduces the error of prediction by subsequently inserting connections and neurons. The type of network element and the location where to insert the element is determined by the maximum reduction of the error of prediction. The algorithm builds non-uniform neural networks without any constraints of size and complexity. The algorithm is additionally implemented into two frameworks, which use a data set limited in size very efficiently, resulting in a more reproducible variable selection and network topology.
The algorithm is applied to a data set of binary mixtures of the refrigerants R22 and R134a, which were measured by a surface plasmon resonance (SPR) device in a time-resolved mode. Compared with common static neural networks all implementations of the growing neural networks show better generalization abilities resulting in low relative errors of prediction of 0.75% for R22 and 1.18% for R134a using unknown data. 相似文献
8.
9.
Reinaldo F. Tefilo Joo Paulo A. Martins Mrcia M. C. Ferreira 《Journal of Chemometrics》2009,23(1):32-48
A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献
10.
Bai-Chuan Deng Yong-Huan Yun Dong-Sheng Cao Yu-Long Yin Wei-Ting Wang Hong-Mei Lu Qian-Yi Luo Yi-Zeng Liang 《Analytica chimica acta》2016
In this study, a new variable selection method called bootstrapping soft shrinkage (BOSS) method is developed. It is derived from the idea of weighted bootstrap sampling (WBS) and model population analysis (MPA). The weights of variables are determined based on the absolute values of regression coefficients. WBS is applied according to the weights to generate sub-models and MPA is used to analyze the sub-models to update weights for variables. The optimization procedure follows the rule of soft shrinkage, in which less important variables are not eliminated directly but are assigned smaller weights. The algorithm runs iteratively and terminates until the number of variables reaches one. The optimal variable set with the lowest root mean squared error of cross-validation (RMSECV) is selected. The method was tested on three groups of near infrared (NIR) spectroscopic datasets, i.e. corn datasets, diesel fuels datasets and soy datasets. Three high performing variable selection methods, i.e. Monte Carlo uninformative variable elimination (MCUVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm partial least squares (GA-PLS) are used for comparison. The results show that BOSS is promising with improved prediction performance. The Matlab codes for implementing BOSS are freely available on the website: http://www.mathworks.com/matlabcentral/fileexchange/52770-boss. 相似文献
11.
An ensemble of Monte Carlo uninformative variable elimination for wavelength selection 总被引:1,自引:0,他引:1
An improved method based on an ensemble of Monte Carlo uninformative variable elimination (EMCUVE) is presented for wavelength selection in multivariate calibration of spectral data. The proposed algorithm introduces Monte Carlo (MC) strategy to uninformative variable elimination-PLS (UVE-PLS) instead of leave-one-out strategy for estimating the contributions of each wavelength variable in the PLS model. In EMCUVE wavelength variables are evaluated by different Monte Carlo uninformative variable elimination (MCUVE) models. Moreover, a fusion of MCUVE and the vote rule can obtain an improvement over the original uninformative variable elimination method. Results obtained from simulated data and real data sets demonstrate that EMCUVE can properly carry out wavelength selection in the course of data analysis and improve predictive ability for multivariate calibration model. 相似文献
12.
Riccardo LeardiRandy J. Pell 《Analytica chimica acta》2002,461(2):189-200
Variable selection using a genetic algorithm is combined with partial least squares (PLS) for the prediction of additive concentrations in polymer films using Fourier transform-infrared (FT-IR) spectral data. An approach using an iterative application of the genetic algorithm is proposed. This approach allows for all variables to be considered and at the same time minimizes the risk of overfitting. We demonstrate that the variables selected by the genetic algorithm are consistent with expert knowledge. This very exciting result is a convincing application that the algorithm can select correct variables in an automated fashion. 相似文献
13.
Javier Moros 《Analytica chimica acta》2008,630(2):150-160
A new cut-off criterion has been proposed for the selection of uninformative variables prior to chemometric partial least squares (PLS) modelling. After variable elimination, PLS regressions were made and assessed comparing the results with those obtained by PLS models based on the full spectral range. To assess the prediction capabilities, uninformative variable elimination (UVE)-PLS and PLS were applied to diffuse reflectance near-infrared spectra of heroin samples. The application of the proposed new cut-off criterion, based on the t-Students distribution, provided similar predictive capabilities of the PLS models than those obtained using the original criteria based on quantile value. However, the repeatability of the number of selected variables was improved significantly. 相似文献
14.
Analysis of amino acids in complex samples by using voltammetry and multivariate calibration methods
A voltammetric method is proposed for the simultaneous determination of tryptophan, cysteine, and tyrosine using multivariate calibration techniques. Various electrodes and voltammetric techniques were explored to ascertain the optimum measurement strategy. Among them, differential pulse voltammetry (DPV) with a Pt electrode was selected as analytical technique since it provided a suitable compromise between sensitivity and reproducibility while allowing the oxidation peaks of the three compounds to be reasonably discriminated. The sensitivity of DPV with Pt electrode for Trp standards was 8.4×10−2 A l mol−1, the repeatability 3.7% and the detection limit below 10−7 M. The lack of full selectivity of the voltammetric data was overcome using multivariate calibration methods on the basis of the differences in the voltammetric waves of each compound. The accuracy of predictions was evaluated preliminarily from the analysis of three-component synthetic mixtures. Subsequently, this method was applied to the analysis of oxidizable amino acids in feed samples. Results obtained were in good concordance with those given by the standard method using an amino acid analyzer. 相似文献
15.
Marengo E Robotti E Bobba M Milli A Campostrini N Righetti SC Cecconi D Righetti PG 《Analytical and bioanalytical chemistry》2008,390(5):1327-1342
2D gel electrophoresis is a tool for measuring protein regulation, involving image analysis by dedicated software (PDQuest,
Melanie, etc.). Here, partial least squares discriminant analysis was applied to improve the results obtained by classic image
analysis and to identify the significant spots responsible for the differences between two datasets. A human colon cancer
HCT116 cell line was analyzed, treated and not treated with a new histone deacetylase inhibitor, RC307. The proteins regulated
by RC307 were detected by analyzing the total lysates and nuclear proteome profiles. Some of the regulated spots were identified
by tandem mass spectrometry. The preliminary data are encouraging and the protein modulation reported is consistent with the
antitumoral effect of RC307 on the HCT116 cell line. Partial least squares discriminant analysis coupled with backward elimination
variable selection allowed the identification of a larger number of spots than classic PDQuest analysis. Moreover, it allows
the achievement of the best performances of the model in terms of prediction and provides therefore more robust and reliable
results. From this point of view, the multivariate procedure applied can be considered a good alternative to standard differential
analysis, also taking into account the interdependencies existing among the variables. 相似文献
16.
We present a novel algorithm for linear multivariate calibration that can generate good prediction results. This is accomplished by the idea of that testing samples are mixed by the calibration samples in proper proportion. The algorithm is based on the mixed model of samples and is therefore called MMS algorithm. With both theoretical support and analysis of two data sets, it is demonstrated that MMS algorithm produces lower prediction errors than partial least squares (PLS2) model, has similar prediction performance to PLS1. In the anti-interference test of background, MMS algorithm performs better than PLS2. At the condition of the lack of some component information, MMS algorithm shows better robustness than PLS2. 相似文献
17.
Jiyong Shi Xuetao Hu Xiaobo Zou Jiewen Zhao Wen Zhang Xiaowei Huang Yaodi Zhu Zhihua Li Yiwei Xu 《Journal of Chemometrics》2016,30(8):442-450
A new heuristic and parallel simulated annealing algorithm was proposed for variable selection in near‐infrared spectroscopy analysis. The algorithm employs a parallel mechanism to enhance the search efficiency, a heuristic mechanism to generate high‐quality candidate solutions, and the concept of Metropolis criterion to estimate accuracy of the candidate solutions. Several near‐infrared datasets have been evaluated under the proposed new algorithm, with partial least squares leading to improved analytical figures of merit upon wavelength selection. Improved robust and predictive regression models were obtained by the new algorithm. The method could also be helpful in other chemometric activities such as classification or quantitative structure‐activity relationship problems. 相似文献
18.
Most multivariate calibration methods require selection of tuning parameters, such as partial least squares (PLS) or the Tikhonov regularization variant ridge regression (RR). Tuning parameter values determine the direction and magnitude of respective model vectors thereby setting the resultant predication abilities of the model vectors. Simultaneously, tuning parameter values establish the corresponding bias/variance and the underlying selectivity/sensitivity tradeoffs. Selection of the final tuning parameter is often accomplished through some form of cross-validation and the resultant root mean square error of cross-validation (RMSECV) values are evaluated. However, selection of a “good” tuning parameter with this one model evaluation merit is almost impossible. Including additional model merits assists tuning parameter selection to provide better balanced models as well as allowing for a reasonable comparison between calibration methods. Using multiple merits requires decisions to be made on how to combine and weight the merits into an information criterion. An abundance of options are possible. Presented in this paper is the sum of ranking differences (SRD) to ensemble a collection of model evaluation merits varying across tuning parameters. It is shown that the SRD consensus ranking of model tuning parameters allows automatic selection of the final model, or a collection of models if so desired. Essentially, the user’s preference for the degree of balance between bias and variance ultimately decides the merits used in SRD and hence, the tuning parameter values ranked lowest by SRD for automatic selection. The SRD process is also shown to allow simultaneous comparison of different calibration methods for a particular data set in conjunction with tuning parameter selection. Because SRD evaluates consistency across multiple merits, decisions on how to combine and weight merits are avoided. To demonstrate the utility of SRD, a near infrared spectral data set and a quantitative structure activity relationship (QSAR) data set are evaluated using PLS and RR. 相似文献
19.
A new variable selection algorithm is described, based on ant colony optimization (ACO). The algorithm aim is to choose, from a large number of available spectral wavelengths, those relevant to the estimation of analyte concentrations or sample properties when spectroscopic analysis is combined with multivariate calibration techniques such as partial least-squares (PLS) regression. The new algorithm employs the concept of cooperative pheromone accumulation, which is typical of ACO selection methods, and optimizes PLS models using a pre-defined number of variables, employing a Monte Carlo approach to discard irrelevant sensors. The performance has been tested on a simulated system, where it shows a significant superiority over other commonly employed selection methods, such as genetic algorithms. Several near infrared spectroscopic experimental data sets have been subjected to the present ACO algorithm, with PLS leading to improved analytical figures of merit upon wavelength selection. The method could be helpful in other chemometric activities such as classification or quantitative structure-activity relationship (QSAR) problems. 相似文献
20.
Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling 总被引:8,自引:0,他引:8
Hector C. Keun Timothy M. D. Ebbels Henrik Antti Mary E. Bollard Olaf Beckonert Elaine Holmes John C. Lindon Jeremy K. Nicholson 《Analytica chimica acta》2003,490(1-2):265-276
Variable scaling alters the covariance structure of data, affecting the outcome of multivariate analysis and calibration. Here we present a new method, variable stability (VAST) scaling, which weights each variable according to a metric of its stability. The beneficial effect of VAST scaling is demonstrated for a data set of 1H NMR spectra of urine acquired as part of a metabonomic study into the effects of unilateral nephrectomy in an animal model. The application of VAST scaling improved the class distinction and predictive power of partial least squares discriminant analysis (PLS-DA) models. The effects of other data scaling and pre-processing methods, such as orthogonal signal correction (OSC), were also tested. VAST scaling produced the most robust models in terms of class prediction, outperforming OSC in this aspect. As a result the subtle, but consistent, metabolic perturbation caused by unilateral nephrectomy could be accurately characterised despite the presence of much greater biological differences caused by normal physiological variation. VAST scaling presents itself as an interpretable, robust and easily implemented data treatment for the enhancement of multivariate data analysis. 相似文献