首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The calibration performance of partial least squares regression for one response (PLS1) can be improved by eliminating uninformative variables. Many variable-reduction methods are based on so-called predictor-variable properties or predictive properties, which are functions of various PLS-model parameters, and which may change during the steps of the variable-reduction process. Recently, a new predictive-property-ranked variable reduction method with final complexity adapted models, denoted as PPRVR-FCAM or simply FCAM, was introduced. It is a backward variable elimination method applied on the predictive-property-ranked variables. The variable number is first reduced, with constant PLS1 model complexity A, until A variables remain, followed by a further decrease in PLS complexity, allowing the final selection of small numbers of variables.  相似文献   

2.
3.
The partial least-squares (PLS) algorithm has become popular for explorative multivariate data analysis and for multivariate calibration. The same PLS algorithm can also be used for confirmatory data analysis. The discussion is limited to analysis of a single response variable. A close correspondence of PLS1 regression to classical analysis of variance (ANOVA) is demonstrated. The design of an experiment is described in terms of discrete design variables for main effects and simple interactions (dummy variables). These are used as regressors X = (x1, x2,…,) for modelling the response variable of the experiment, y. As in conventional use of PLS1 regression, the algorithm gives a concentrated model or diagram of the most important, y-relevant variability types in the X-data. In the present case, this gives the combination of design variables that models the variations in y. A simple plot of the resulting factor loadings immediately reveals the important design variables. Statistical tests and confidence regions in the PLS solution give additional safeguards against interpretation of spurious effects. The method is applied to two data sets. One concerns assessment of personal preference for blackcurrent juice, studied in a 25 factorial experiment; these data are also studied with missing values and as fractional factorials. The other ceoncers spectrophotometric absorbance-based colour assessments of pigment in strawberry jam in a 3-factor design with 2, 2 and 3 levels in the respective factors.  相似文献   

4.
A new cut-off criterion has been proposed for the selection of uninformative variables prior to chemometric partial least squares (PLS) modelling. After variable elimination, PLS regressions were made and assessed comparing the results with those obtained by PLS models based on the full spectral range. To assess the prediction capabilities, uninformative variable elimination (UVE)-PLS and PLS were applied to diffuse reflectance near-infrared spectra of heroin samples. The application of the proposed new cut-off criterion, based on the t-Students distribution, provided similar predictive capabilities of the PLS models than those obtained using the original criteria based on quantile value. However, the repeatability of the number of selected variables was improved significantly.  相似文献   

5.
《Analytical letters》2012,45(6):1227-1251
Abstract

In order to reduce data nonlinearity and overfitting with the multivariate calibration model y=Xb, a modified Tikhonov regularization (TR) algorithm is evaluated for selecting key variables from an X augmented with extra columns that contain the original measured variables (x ij ) as squared terms (x ij 2) and other orders. The TR approach simultaneously develops the multivariate calibration model. The new generalized pair‐correlation method (GPCM) is also studied for variable selection followed by partial least squares (PLS) for multivariate calibration. Results from synthetic spectral data are compared when using the modified TR approach, GPCM, and PLS without variable selection. The GPCM usually performs slightly better than the TR approach for tabulated bias and variance measures and in some cases, at a sacrifice to parsimony. The method of PLS without variable selection performs the worst. By using synthetic spectral data sets, how the methods work could be studied. Thus, results from this study will aid investigators of real spectral data sets exhibiting nonlinear behavior.  相似文献   

6.
Glycerol monolaurate (GML) products contain many impurities, such as lauric acid and glucerol. The GML content is an important quality indicator for GML production. A hybrid variable selection algorithm, which is a combination of wavelet transform (WT) technology and modified uninformative variable eliminate (MUVE) method, was proposed to extract useful information from Fourier transform infrared (FT-IR) transmission spectroscopy for the determination of GML content. FT-IR spectra data were compressed by WT first; the irrelevant variables in the compressed wavelet coefficients were eliminated by MUVE. In the MUVE process, simulated annealing (SA) algorithm was employed to search the optimal cutoff threshold. After the WT-MUVE process, variables for the calibration model were reduced from 7366 to 163. Finally, the retained variables were employed as inputs of partial least squares (PLS) model to build the calibration model. For the prediction set, the correlation coefficient (r) of 0.9910 and root mean square error of prediction (RMSEP) of 4.8617 were obtained. The prediction result was better than the PLS model with full-spectra data. It was indicated that proposed WT-MUVE method could not only make the prediction more accurate, but also make the calibration model more parsimonious. Furthermore, the reconstructed spectra represented the projection of the selected wavelet coefficients into the original domain, affording the chemical interpretation of the predicted results. It is concluded that the FT-IR transmission spectroscopy technique with the proposed method is promising for the fast detection of GML content.  相似文献   

7.
The selection abilities of the two well‐known techniques of variable selection, synergy interval‐partial least‐squares (SiPLS) and genetic algorithm‐partial least‐squares (GA‐PLS), have been examined and compared. By using different simulated and real (corn and metabolite) datasets, keeping in view the spectral overlapping of the components, the influence of the selection of either intervals of variables or individual variables on the prediction performances was examined. In the simulated datasets, with decrease in the overlapping of the spectra of components and cases with components of narrow bands, GA‐PLS results were better. In contrast, the performance of SiPLS was higher for data of intermediate overlapping. For mixtures of high overlapping analytes, GA‐PLS showed slightly better performance. However, significant differences between the results of the two selection methods were not observed in most of the cases. Although SiPLS resulted in slightly better performance of prediction in the case of corn dataset except for the prediction of the moisture content, the improvement obtained by SiPLS compared with that by GA‐PLS was not significant. For real data of less overlapped components (metabolite dataset), GA‐PLS that tends to select far fewer variables did not give significantly better root mean square error of cross‐validation (RMSECV), cross‐validated R2 (Q2), and root mean square error of prediction (RMSEP) compared with SiPLS. Irrespective of the type of dataset, GA‐PLS resulted in models with fewer latent variables (LVs). When comparing the computational time of the methods, GA‐PLS is considered superior to SiPLS. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

8.
This study presents an analytical method for determining interfacial tension and relative density in insulating oils using near infrared spectrometry (NIR). Five different strategies of regression were evaluated: partial least squares (PLS) with significant regression coefficients selected by jack-knife algorithm; interval PLS (iPLS); multiple linear regression (MLR) with variable selection by genetic algorithm (MLR/GA), successive projections algorithm (MLR/SPA) and stepwise strategy (SR/MLR). The overall results point to MLR/SPA as the best modeling strategy. The strategy is simpler and uses fewer spectral variables.  相似文献   

9.
This paper proposes an analytical method for simultaneous near-infrared (NIR) spectrometric determination of α-linolenic and linoleic acid in eight types of edible vegetable oils and their blending. For this purpose, a combination of spectral wavelength selection by wavelet transform (WT) and elimination of uninformative variables (UVE) was proposed to obtain simple partial least square (PLS) models based on a small subset of wavelengths. WT was firstly utilized to compress full NIR spectra which contain 1413 redundant variables, and 42 wavelet approximate coefficients were obtained. UVE was then carried out to further select the informative variables. Finally, 27 and 19 wavelet approximate coefficients were selected by UVE for α-linolenic and linoleic acid, respectively. The selected variables were used as inputs of PLS model. Due to original spectra were compressed, and irrelevant variables were eliminated, more parsimonious and efficient model based on WT-UVE was obtained compared with the conventional PLS model with full spectra data. The coefficient of determination (r2) and root mean square error prediction set (RMSEP) for prediction set were 0.9345 and 0.0123 for α-linolenic acid prediction by WT-UVE-PLS model. The r2 and RMSEP were 0.9054, 0.0437 for linoleic acid prediction. The good performance showed a potential application using WT-UVE to select NIR effective variables. WT-UVE can both speed up the calculation and improve the predicted results. The results indicated that it was feasible to fast determine α-linolenic acid and linoleic acid content in edible oils using NIR spectroscopy.  相似文献   

10.
This article describes the applicability of multivariate projection techniques, such as principal-component analysis (PCA) and partial least-squares (PLS) projections to latent structures, to the large-volume high-density data structures obtained within genomics, proteomics, and metabonomics. PCA and PLS, and their extensions, derive their usefulness from their ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. Three examples are used as illustrations: the first example is a genomics data set and involves modeling of microarray data of cell cycle-regulated genes in the microorganism Saccharomyces cerevisiae. The second example contains NMR-metabonomics data, measured on urine samples of male rats treated with either of the drugs chloroquine or amiodarone. The third and last data set describes sequence-function classification studies in a set of G-protein-coupled receptors using hierarchical PCA.  相似文献   

11.
In multivariate data analysis such as principal components analysis (PCA) and projections to latent structures (PLS), it is essential that the training set systems (objects) are selected to provide data with substantial information for model parametrization, and to represent properly any future situations where the multilvariate model is used for predictions. In the framework of multivariate projections (PCA, SIMCA and PLS), elementary concepts of statistical design (fractional factorials and composite designs) can be used with the latent variables (PC or PLS scores) as design variables. The plan of action thus becomes: (1) problem formulation (specify aim and model, make a conceptual division of the investigated system into subsystems); (2) collection of multivariate data for each type of subsystems; (3) estimation of the practical dimensionality of the data for each type of subsystems by PC or PLS analysis; (4) use of the PC or PLS scores (t) as design variables in the combination of subsystems to systems in the training set; (5) measurement of responses (Y); (6) analysis of data by PCA or PLS; (7) interpretation of results with possible feedback to steps 1, 2 or 3. The procedures are illustrated by two problems: a structure/activity relationship for a family of peptides, and optimization of an organic synthesis with respect to system variables (solvent, substrate, co-reactant_) and process variables (temperature, reactant concentrations).  相似文献   

12.
Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached.  相似文献   

13.
Some Mallotus species are used in traditional medicine in Vietnam and China. Some also show interesting activities, such as antioxidant and cytotoxic ones. Combining fingerprint technology with data-handling techniques allows indicating the peaks potentially responsible for given activities. In this study it is aspired to indicate from chromatographic fingerprints the peaks potentially responsible for the antioxidant activity of several Mallotus species. Relevant information was extracted using linear multivariate calibration techniques, both before and after alignment of the fingerprints with correlation optimized warping (COW). From the studied techniques, Stepwise Multiple Linear Regression is least recommended as it made an inadequate variable selection. Principal Component Regression theoretically can take largely varying variables uncorrelated to the antioxidant activity into account. However, in practice in the actual case study this problem was limited. These problems in principle do not occur using Partial Least Squares (PLS) models. Of the tested PLS methods, Orthogonal Projections to Latent Structures was preferred because of its simplicity, reproducibility, reduced model complexity and improved interpretability of the regression coefficients, yielding a clearer view on the individual contribution of the compounds. Furthermore, reducing analysis times from 60 min to 35 and 22.5 min resulted in the same main compounds, indicated responsible for the antioxidant activity. Models built after alignment by COW did not result in additional information.  相似文献   

14.
Different strategies for wavelength selection for partial least squared (PLS) calibration models have been proposed. In this article, Kohonen artificial neural networks (K-ANN) are used to select optimal sets of wavelengths for PLS calibration of mixtures with stray overlapping. This kind of variable selection appears simple and very effective due to the well known high correlation of spectroscopic data; a measure of the multivariate correlation of the different wavelength subsets is also given. This strategy has been applied to the resolution of mixtures of phenol, o-cresol, m-cresol and p-cresol by spectrofluorimetry. The number of samples to obtain the calibration matrix is also reduced with respect to the number necessary when the full spectrum is used, and the predictive ability of the PLS method is improved.  相似文献   

15.
The characteristic shifts ΔνOH of ketone-phenol associations have been measured for 40 benzophenones of the Me2, XC6H3COC6H4Y (Me2, XBY) type in which X and Y are variable substituents. A correlation shows that the CO has a greater sensitivity to the effect of Y for the Me2 BY population. This result is attributed to the destabilization arising from the ortho-methylated aromatic ring thereby rendering the carbonyl group more sensitive to the substituents carried by the other. For two populations of compounds where destabilization by torsion is constant, a second homogeneous correlation allows us to measure the rôle played by the molecular torsion on the attenuation of substituent effects. When the molecule is partially deconjugated and the substituent effect decreased, a third and final correlation shows that these parameters, which have opposite effects, balance each other. A comparison of these results with the pKBH+ of the benzophenones indicates that in a state nearer the fundamental state, the influence of the molecular geometry on the transmission of the substituent effect is more marked than in a more perturbed state where the phenomenon is masked by electronic interactions between the substituents X and Y  相似文献   

16.
Variable (wavelength or feature) selection techniques have become a critical step for the analysis of datasets with high number of variables and relatively few samples. In this study, a novel variable selection strategy, variable combination population analysis (VCPA), was proposed. This strategy consists of two crucial procedures. First, the exponentially decreasing function (EDF), which is the simple and effective principle of ‘survival of the fittest’ from Darwin’s natural evolution theory, is employed to determine the number of variables to keep and continuously shrink the variable space. Second, in each EDF run, binary matrix sampling (BMS) strategy that gives each variable the same chance to be selected and generates different variable combinations, is used to produce a population of subsets to construct a population of sub-models. Then, model population analysis (MPA) is employed to find the variable subsets with the lower root mean squares error of cross validation (RMSECV). The frequency of each variable appearing in the best 10% sub-models is computed. The higher the frequency is, the more important the variable is. The performance of the proposed procedure was investigated using three real NIR datasets. The results indicate that VCPA is a good variable selection strategy when compared with four high performing variable selection methods: genetic algorithm–partial least squares (GA–PLS), Monte Carlo uninformative variable elimination by PLS (MC-UVE-PLS), competitive adaptive reweighted sampling (CARS) and iteratively retains informative variables (IRIV). The MATLAB source code of VCPA is available for academic research on the website: http://www.mathworks.com/matlabcentral/fileexchange/authors/498750.  相似文献   

17.

When X and Y are multivariate, the two-block partial least squares (PLS) method is often used. In this paper, we outline an extension addressing a special case of the three-block (X/Y/Z) problem, where Z sits "under" Y. We have called this approach three-block bi-focal PLS (3BIF-PLS). It views the X/Y relationship as the dominant problem, and seeks to use the additional information in Z in order to improve the interpretation of the Y-part of the X/Y association. Two data sets are used to illustrate 3BIF-PLS. Example I relates to single point mutants of haloalkane dehalogenase from Sphingomonas paucimobilis UT26 and their ability to transform halogenated hydrocarbons, some of which are found as organic pollutants in soil. Example II deals with soil remediation capability of bacteria. Whole bacterial communities are monitored over time using "DNA-fingerprinting" technology to see how pollution affects population composition. Since the data sets are large, hierarchical multivariate modelling is invoked to compress data prior to 3BIF-PLS analysis. It is concluded that the 3BIF-PLS approach works well. The paper contains a discussion of pros and cons of the method, and hints at further developmental opportunities.  相似文献   

18.
19.
Ni Xin  Qinghua Meng  Yizhen Li  Yuzhu Hu 《中国化学》2011,29(11):2533-2540
This paper indicates the possibility to use near infrared (NIR) spectral similarity as a rapid method to estimate the quality of Flos Lonicerae. Variable selection together with modelling techniques is utilized to select representative variables that are used to calculate the similarity. NIR is used to build calibration models to predict the bacteriostatic activity of Flos Lonicerae. For the determination of the bacteriostatic activity, the in vitro experiment is used. Models are built for the Gram‐positive bacteria and also for the Gram‐negative bacteria. A genetic algorithm combined with partial least squares regression (GA‐PLS) is used to perform the calibration. The results of GA‐PLS models are compared to interval partial least squares (iPLS) models, full‐spectrum PLS and full‐spectrum principal component regression (PCR) models. Then, the variables in the two GA‐PLS models are combined and then used to calculate the NIR spectral similarity of samples. The similarity based on the characteristic variables and full spectrum is used for evaluating the fingerprints of Flos Lonicerae, respectively. The results show that the combination of variable selection method, modelling techniques and similarity analysis might be a powerful tool for quality control of traditional Chinese medicine (TCM).  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号