首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Multivariate calibration problems often involve the identification of a meaningful subset of variables, from a vast number of variables for better prediction of output variables. A new graph theoretic method based on partial correlations (variable interaction network—VIN) is proposed. Many well studied representative calibration datasets spanning different application domains are selected for investigating the performance. Partial least squares (PLS) regression models combined with variable selection techniques are employed for benchmarking the performance. Subsets of variables with different number of variables are retained for the final analysis after VIN selection and progressive prediction accuracies are used for comparison. VIN-PLS results show significant improvement in prediction efficiencies and variable subset optimization. Improvement of up to 45% over existing methods with significantly fewer variables is achieved using the new method. Advantages of VIN based variable selection are highlighted.  相似文献   

2.
Variable (wavelength or feature) selection techniques have become a critical step for the analysis of datasets with high number of variables and relatively few samples. In this study, a novel variable selection strategy, variable combination population analysis (VCPA), was proposed. This strategy consists of two crucial procedures. First, the exponentially decreasing function (EDF), which is the simple and effective principle of ‘survival of the fittest’ from Darwin’s natural evolution theory, is employed to determine the number of variables to keep and continuously shrink the variable space. Second, in each EDF run, binary matrix sampling (BMS) strategy that gives each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, model population analysis (MPA) is employed to find the variable subsets with the lower root mean squares error of cross validation (RMSECV). The frequency of each variable appearing in the best 10% sub-models is computed. The higher the frequency is, the more important the variable is. The performance of the proposed procedure was investigated using three real NIR datasets. The results indicate that VCPA is a good variable selection strategy when compared with four high performing variable selection methods: genetic algorithm–partial least squares (GA–PLS), Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS), competitive adaptive reweighted sampling (CARS) and iteratively retains informative variables (IRIV). The MATLAB source code of VCPA is available for academic research on the website: http://www.mathworks.com/matlabcentral/fileexchange/authors/498750.  相似文献   

3.
This paper presents a Bayesian approach to the development of spectroscopic calibration models. By formulating the linear regression in a probabilistic framework, a Bayesian linear regression model is derived, and a specific optimization method, i.e. Bayesian evidence approximation, is utilized to estimate the model “hyper-parameters”. The relation of the proposed approach to the calibration models in the literature is discussed, including ridge regression and Gaussian process model. The Bayesian model may be modified for the calibration of multivariate response variables. Furthermore, a variable selection strategy is implemented within the Bayesian framework, the motivation being that the predictive performance may be improved by selecting a subset of the most informative spectral variables. The Bayesian calibration models are applied to two spectroscopic data sets, and they demonstrate improved prediction results in comparison with the benchmark method of partial least squares.  相似文献   

4.
2D gel electrophoresis is a tool for measuring protein regulation, involving image analysis by dedicated software (PDQuest, Melanie, etc.). Here, partial least squares discriminant analysis was applied to improve the results obtained by classic image analysis and to identify the significant spots responsible for the differences between two datasets. A human colon cancer HCT116 cell line was analyzed, treated and not treated with a new histone deacetylase inhibitor, RC307. The proteins regulated by RC307 were detected by analyzing the total lysates and nuclear proteome profiles. Some of the regulated spots were identified by tandem mass spectrometry. The preliminary data are encouraging and the protein modulation reported is consistent with the antitumoral effect of RC307 on the HCT116 cell line. Partial least squares discriminant analysis coupled with backward elimination variable selection allowed the identification of a larger number of spots than classic PDQuest analysis. Moreover, it allows the achievement of the best performances of the model in terms of prediction and provides therefore more robust and reliable results. From this point of view, the multivariate procedure applied can be considered a good alternative to standard differential analysis, also taking into account the interdependencies existing among the variables.  相似文献   

5.
A critical step in multivariate calibration is wavelength selection, which is used to build models with better prediction performance when applied to spectral data. Up to now, many feature selection techniques have been developed. Among all different types of feature selection techniques, those based on swarm intelligence optimization methodologies are more interesting since they are usually simulated based on animal and insect life behavior to, e.g., find the shortest path between a food source and their nests. This decision is made by a crowd, leading to a more robust model with less falling in local minima during the optimization cycle.  相似文献   

6.
In multivariate calibration with the spectral dataset, variable selection is often applied to identify relevant subset of variables, leading to improved prediction accuracy and easy interpretation of the selected fingerprint regions. Until now, numerous variable selection methods have been proposed, but a proper choice among them is not trivial. Furthermore, in many cases, a set of variables found by those methods might not be robust due to the irreproducibility and uncertainty issues, posing a great challenge in improving the reliability of the variable selection. In this study, the reproducibility of the 5 variable selection methods was investigated quantitatively for evaluating their performance. The reproducibility of variable selection was quantified by using Monte-Carlo sub-sampling (MCS) techniques together with the quantitative similarity measure designed for the highly collinear spectral dataset. The investigation of reproducibility and prediction accuracy of the several variable selection algorithms with two different near-infrared (NIR) datasets illustrated that the different variable selection methods exhibited wide variability in their performance, especially in their capabilities to identify the consistent subset of variables from the spectral datasets. Thus the thorough assessment of the reproducibility together with the predictive accuracy of the identified variables improved the statistical validity and confidence of the selection outcome, which cannot be addressed by the conventional evaluation schemes.  相似文献   

7.
Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached.  相似文献   

8.
Han QJ  Wu HL  Cai CB  Xu L  Yu RQ 《Analytica chimica acta》2008,612(2):121-125
An improved method based on an ensemble of Monte Carlo uninformative variable elimination (EMCUVE) is presented for wavelength selection in multivariate calibration of spectral data. The proposed algorithm introduces Monte Carlo (MC) strategy to uninformative variable elimination-PLS (UVE-PLS) instead of leave-one-out strategy for estimating the contributions of each wavelength variable in the PLS model. In EMCUVE wavelength variables are evaluated by different Monte Carlo uninformative variable elimination (MCUVE) models. Moreover, a fusion of MCUVE and the vote rule can obtain an improvement over the original uninformative variable elimination method. Results obtained from simulated data and real data sets demonstrate that EMCUVE can properly carry out wavelength selection in the course of data analysis and improve predictive ability for multivariate calibration model.  相似文献   

9.
In this study, a new variable selection method called bootstrapping soft shrinkage (BOSS) method is developed. It is derived from the idea of weighted bootstrap sampling (WBS) and model population analysis (MPA). The weights of variables are determined based on the absolute values of regression coefficients. WBS is applied according to the weights to generate sub-models and MPA is used to analyze the sub-models to update weights for variables. The optimization procedure follows the rule of soft shrinkage, in which less important variables are not eliminated directly but are assigned smaller weights. The algorithm runs iteratively and terminates until the number of variables reaches one. The optimal variable set with the lowest root mean squared error of cross-validation (RMSECV) is selected. The method was tested on three groups of near infrared (NIR) spectroscopic datasets, i.e. corn datasets, diesel fuels datasets and soy datasets. Three high performing variable selection methods, i.e. Monte Carlo uninformative variable elimination (MCUVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm partial least squares (GA-PLS) are used for comparison. The results show that BOSS is promising with improved prediction performance. The Matlab codes for implementing BOSS are freely available on the website: http://www.mathworks.com/matlabcentral/fileexchange/52770-boss.  相似文献   

10.
Bio-pharmaceutical manufacturing is a multifaceted and complex process wherein the manufacture of a single batch hundreds of processing variables and raw materials are monitored. In these processes, identifying the candidate variables responsible for any changes in process performance can prove to be extremely challenging. Within this context, partial least squares (PLS) has proven to be an important tool in helping determine the root cause for changes in biological performance, such as cellular growth or viral propagation. In spite of the positive impact PLS has had in helping understand bio-pharmaceutical process data, the high variability in measured response (Y) and predictor variables (X), and weak relationship between X and Y, has at times made root cause determination for process changes difficult. Our goal is to demonstrate how the use of bootstrapping, in conjunction with permutation tests, can provide avenues for improving the selection of variables responsible for manufacturing process changes via the variable importance in the projection (PLS-VIP) statistic. Although applied uniquely to the PLS-VIP in this article, the generality of the aforementioned methods can be used to improve other variable selection methods, in addition to increasing confidence around other estimates obtained from a PLS model.  相似文献   

11.
An efficient method for detecting malicious and accidental contamination of foods has been developed using a combined 1H nuclear magnetic resonance (NMR) and chemometrics approach. The method has been demonstrated using a commercially available carbonated soft drink, as being capable of identifying atypical products and to identify contaminant resonances. Soft-independent modelling of class analogy (SIMCA) was used to compare 1H NMR profiles of genuine products (obtained from the manufacturer) against retail products spiked in the laboratory with impurities. The benefits of using feature selection for extracting contaminant NMR frequencies were also assessed. Using example impurities (paraquat, p-cresol and glyphosate) NMR spectra were analysed using multivariate methods resulting in detection limits of approximately 0.075, 0.2, and 0.06 mM for p-cresol, paraquat and glyphosate, respectively. These detection limits are shown to be approximately 100-fold lower than the minimum lethal dose for paraquat. The methodology presented here is used to assess the composition of complex matrices for the presence of contaminating molecules without a priori knowledge of the nature of potential contaminants. The ability to detect if a sample does not fit into the expected profile without recourse to multiple targeted analyses is a valuable tool for incident detection and forensic applications.  相似文献   

12.
13.
Nowadays, with a high dimensionality of dataset, it faces a great challenge in the creation of effective methods which can select an optimal variables subset. In this study, a strategy that considers the possible interaction effect among variables through random combinations was proposed, called iteratively retaining informative variables (IRIV). Moreover, the variables are classified into four categories as strongly informative, weakly informative, uninformative and interfering variables. On this basis, IRIV retains both the strongly and weakly informative variables in every iterative round until no uninformative and interfering variables exist. Three datasets were employed to investigate the performance of IRIV coupled with partial least squares (PLS). The results show that IRIV is a good alternative for variable selection strategy when compared with three outstanding and frequently used variable selection methods such as genetic algorithm-PLS, Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS) and competitive adaptive reweighted sampling (CARS). The MATLAB source code of IRIV can be freely downloaded for academy research at the website: http://code.google.com/p/multivariate-calibration/downloads/list.  相似文献   

14.
In this work we evaluated the use of different variable selection techniques combined with partial least‐squares regression (PLS) – genetic algorithm PLS (GA‐PLS), interval PLS (iPLS), and synergy interval PLS (siPLS) – in the simultaneous determination of Cd(II), Cu(II), Pb(II) and Zn(II) by anodic stripping voltammetry at a bismuth film. Generally, variable selection provided an improvement in prediction results when compared to full‐voltammogram PLS. The use of interval selection based algorithms have shown to be most adequate than the selection of discrete variables by GA. Excellent analytical performances were obtained despite the inherent complexity of the simultaneous determination.  相似文献   

15.
Pierce KM  Schale SP 《Talanta》2011,83(4):1254-1259
The percent composition of blends of biodiesel and conventional diesel from a variety of retail sources were modeled and predicted using partial least squares (PLS) analysis applied to gas chromatography-total-ion-current mass spectrometry (GC-TIC), gas chromatography-mass spectrometry (GC-MS), comprehensive two-dimensional gas chromatography-total-ion-current mass spectrometry (GCxGC-TIC) and comprehensive two-dimensional gas chromatography-mass spectrometry (GCxGC-MS) separations of the blends. In all four cases, the PLS predictions for a test set of chromatograms were plotted versus the actual blend percent composition. The GC-TIC plot produced a best-fit line with slope = 0.773 and y-intercept = 2.89, and the average percent error of prediction was 12.0%. The GC-MS plot produced a best-fit line with slope = 0.864 and y-intercept = 1.72, and the average percent error of prediction was improved to 6.89%. The GCxGC-TIC plot produced a best-fit line with slope = 0.983 and y-intercept = 0.680, and the average percent error was slightly improved to 6.16%. The GCxGC-MS plot produced a best-fit line with slope = 0.980 and y-intercept = 0.620, and the average percent error was 6.12%. The GCxGC models performed best presumably due to the multidimensional advantage of higher dimensional instrumentation providing more chemical selectivity. All the PLS models used 3 latent variables. The chemical components that differentiate the blend percent compositions are reported.  相似文献   

16.
The calibration performance of partial least squares regression for one response (PLS1) can be improved by eliminating uninformative variables. Many variable-reduction methods are based on so-called predictor-variable properties or predictive properties, which are functions of various PLS-model parameters, and which may change during the steps of the variable-reduction process. Recently, a new predictive-property-ranked variable reduction method with final complexity adapted models, denoted as PPRVR-FCAM or simply FCAM, was introduced. It is a backward variable elimination method applied on the predictive-property-ranked variables. The variable number is first reduced, with constant PLS1 model complexity A, until A variables remain, followed by a further decrease in PLS complexity, allowing the final selection of small numbers of variables.  相似文献   

17.
HPLC with acidic potassium permanganate chemiluminescence detection was employed to analyse 17 Cabernet Sauvignon wines across a range of vintages (1971-2003). Partial least squares regression analysis and principal components analysis was used in order to investigate the relationship between wine composition and vintage. Tartaric acid, vanillic acid, catechin, sinapic acid, ethyl gallate, myricetin, procyanadin B and resveratrol were found to be important components in terms of differences between the vintages.  相似文献   

18.
Variable scaling alters the covariance structure of data, affecting the outcome of multivariate analysis and calibration. Here we present a new method, variable stability (VAST) scaling, which weights each variable according to a metric of its stability. The beneficial effect of VAST scaling is demonstrated for a data set of 1H NMR spectra of urine acquired as part of a metabonomic study into the effects of unilateral nephrectomy in an animal model. The application of VAST scaling improved the class distinction and predictive power of partial least squares discriminant analysis (PLS-DA) models. The effects of other data scaling and pre-processing methods, such as orthogonal signal correction (OSC), were also tested. VAST scaling produced the most robust models in terms of class prediction, outperforming OSC in this aspect. As a result the subtle, but consistent, metabolic perturbation caused by unilateral nephrectomy could be accurately characterised despite the presence of much greater biological differences caused by normal physiological variation. VAST scaling presents itself as an interpretable, robust and easily implemented data treatment for the enhancement of multivariate data analysis.  相似文献   

19.
The paper presents an approach to use Partial Least Squares Discriminant Analysis (PLS-DA) on X-ray powder diffractometry (XRPD) dataset to build a model which recognizes a presence (or absence) of particular drug substance (acetaminophen) in unknown mixture (OTC tablet). The dataset consisted of 33 XRPD signals, measured for 12 pure substances and 21 tablets containing them in different quantitative and qualitative ratios, along with unknown excipients. The model was built with an external validation dataset chosen by Kennard-Stone algorithm. The RMSECV value was equal to 0.3461 (87.8% of explained variance) and external predictive error (RMSEP) was equal to 0.3123 (86.2% of explained variance). The result suggests that small but properly prepared training datasets give ability to construct well-working discriminant models on XRPD signals.  相似文献   

20.
Using a series of thirteen organic materials that includes novel high-nitrogen energetic materials, conventional organic military explosives, and benign organic materials, we have demonstrated the importance of variable selection for maximizing residue discrimination with partial least squares discriminant analysis (PLS-DA). We built several PLS-DA models using different variable sets based on laser induced breakdown spectroscopy (LIBS) spectra of the organic residues on an aluminum substrate under an argon atmosphere. The model classification results for each sample are presented and the influence of the variables on these results is discussed. We found that using the whole spectra as the data input for the PLS-DA model gave the best results. However, variables due to the surrounding atmosphere and the substrate contribute to discrimination when the whole spectra are used, indicating this may not be the most robust model. Further iterative testing with additional validation data sets is necessary to determine the most robust model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号