首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Processing plants can produce large amounts of data that process engineers use for analysis, monitoring, or control. Principal component analysis (PCA) is well suited to analyze large amounts of (possibly) correlated data, and for reducing the dimensionality of the variable space. Failing online sensors, lost historical data, or missing experiments can lead to data sets that have missing values where the current methods for obtaining the PCA model parameters may give questionable results due to the properties of the estimated parameters. This paper proposes a method based on nonlinear programming (NLP) techniques to obtain the parameters of PCA models in the presence of incomplete data sets. We show the relationship that exists between the nonlinear iterative partial least squares (NIPALS) algorithm and the optimality conditions of the squared residuals minimization problem, and how this leads to the modified NIPALS used for the missing value problem. Moreover, we compare the current NIPALS‐based methods with the proposed NLP with a simulation example and an industrial case study, and show how the latter is better suited when there are large amounts of missing values. The solutions obtained with the NLP and the iterative algorithm (IA) are very similar. However when using the NLP‐based method, the loadings and scores are guaranteed to be orthogonal, and the scores will have zero mean. The latter is emphasized in the industrial case study. Also, with the industrial data used here we are able to show that the models obtained with the NLP were easier to interpret. Moreover, when using the NLP many fewer iterations were required to obtain them. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

2.
The insight from, and conclusions of this paper motivate efficient and numerically robust ‘new’ variants of algorithms for solving the single response partial least squares regression (PLS1) problem. Prototype MATLAB code for these variants are included in the Appendix. The analysis of and conclusions regarding PLS1 modelling are based on a rich and nontrivial application of numerous key concepts from elementary linear algebra. The investigation starts with a simple analysis of the nonlinear iterative partial least squares (NIPALS) PLS1 algorithm variant computing orthonormal scores and weights. A rigorous interpretation of the squared P ‐loadings as the variable‐wise explained sum of squares is presented. We show that the orthonormal row‐subspace basis of W ‐weights can be found from a recurrence equation. Consequently, the NIPALS deflation steps of the centered predictor matrix can be replaced by a corresponding sequence of Gram–Schmidt steps that compute the orthonormal column‐subspace basis of T ‐scores from the associated non‐orthogonal scores. The transitions between the non‐orthogonal and orthonormal scores and weights (illustrated by an easy‐to‐grasp commutative diagram), respectively, are both given by QR factorizations of the non‐orthogonal matrices. The properties of singular value decomposition combined with the mappings between the alternative representations of the PLS1 ‘truncated’ X data (including P t W ) are taken to justify an invariance principle to distinguish between the PLS1 truncation alternatives. The fundamental orthogonal truncation of PLS1 is illustrated by a Lanczos bidiagonalization type of algorithm where the predictor matrix deflation is required to be different from the standard NIPALS deflation. A mathematical argument concluding the PLS1 inconsistency debate (published in 2009 in this journal) is also presented. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

3.
An evaluation of computational performance and precision regarding the cross‐validation error of five partial least squares (PLS) algorithms (NIPALS, modified NIPALS, Kernel, SIMPLS and bidiagonal PLS), available and widely used in the literature, is presented. When dealing with large data sets, computational time is an important issue, mainly in cross‐validation and variable selection. In the present paper, the PLS algorithms are compared in terms of the run time and the relative error in the precision obtained when performing leave‐one‐out cross‐validation using simulated and real data sets. The simulated data sets were investigated through factorial and Latin square experimental designs. The evaluations were based on the number of rows, the number of columns and the number of latent variables. With respect to their performance, the results for both simulated and real data sets have shown that the differences in run time are statistically different. PLS bidiagonal is the fastest algorithm, followed by Kernel and SIMPLS. Regarding cross‐validation error, all algorithms showed similar results. However, in some situations as, for example, when many latent variables were in question, discrepancies were observed, especially with respect to SIMPLS. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

4.
Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

5.
This paper presents a modified version of the NIPALS algorithm for PLS regression with one single response variable. This version, denoted a CF‐PLS, provides significant advantages over the standard PLS. First of all, it strongly reduces the over‐fit of the regression. Secondly, R2 for the null hypothesis follows a Beta distribution only function of the number of observations, which allows the use of a probabilistic framework to test the validity of a component. Thirdly, the models generated with CF‐PLS have comparable if not better prediction ability than the models fitted with NIPALS. Finally, the scores and loadings of the CF‐PLS are directly related to the R2, which makes the model and its interpretation more reliable. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

6.
It is well known that the predictions of the single response orthogonal projections to latent structures (OPLS) and the single response partial least squares regression (PLS1) regression are identical in the single‐response case. The present paper presents an approach to identification of the complete y ‐orthogonal structure by starting from the viewpoint of standard PLS1 regression. Three alternative non‐deflating OPLS algorithms and a modified principal component analysis (PCA)‐driven method (including MATLAB code) is presented. The first algorithm implements a postprocessing routine of the standard PLS1 solution where QR factorization applied to a shifted version of the non‐orthogonal scores is the key to express the OPLS solution. The second algorithm finds the OPLS model directly by an iterative procedure. By a rigorous mathematical argument, we explain that orthogonal filtering is a ‘built‐in’ property of the traditional PLS1 regression coefficients. Consequently, the capabilities of OPLS with respect to improving the predictions (also for new samples) compared with PLS1 are non‐existing. The PCA‐driven method is based on the fact that truncating off one dimension from the row subspace of X results in a matrix X orth with y ‐orthogonal columns and a rank of one less than the rank of X . The desired truncation corresponds exactly to the first X deflation step of Martens non‐orthogonal PLS algorithm. The significant y ‐orthogonal structure of X found by PCA of X orth is split into two fundamental parts: one part that is significantly contributing to correct the first PLS score toward y and one part that is not. The third and final OPLS algorithm presented is a modification of Martens non‐orthogonal algorithm into an efficient dual PLS1–OPLS algorithm. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

7.
The well‐known Martens factorization for PLS1 produces a single y‐related score, with all subsequent scores being y‐unrelated. The X‐explanatory value of these y‐orthogonal scores can be summarized by a simple expression, which is analogous to the ‘P’ loading weights in the orthogonalized NIPALS algorithm. This can be used to rearrange the factorization into entirely y‐related and y‐unrelated parts. Systematic y‐unrelated variation can thus be removed from the X data through a single post hoc calculation following conventional PLS, without any recourse to the orthogonal projections to latent structures (OPLS) algorithm. The work presented is consistent with the development by Ergon (PLS post‐processing by similarity transformation (PLS + ST): a simple alternative to OPLS. J. Chemom. 2005; 19 : 1–4), which shows that conventional PLS and OPLS are equivalent within a similarity transform. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

8.
成忠  诸爱士 《分析化学》2008,36(6):788-792
针对光谱数据峰宽、局部效应显著、含有噪音、变量个数多及彼此间常存在严重的复共线性等问题,改进和设计一种光谱数据局部校正方法:基于窗口平滑的段式正交信号校正方法,并将之结合偏最小二乘回归,以实现光谱数据的预处理及定量分析。通过NIPALS算法初始化将滤去的正交成分,以近邻分段方式进行逐个波长点的正交信号校正。而后将去噪后的光谱矩阵作为新的自变量阵,通过偏最小二乘回归构建其与性质参变量间的校正模型。通过小麦近红外漫反射光谱数据的应用实验结果表明,本方法正交成分估计稳定,去噪明显,模型的预报性能优于其它方法,PLS成分数减少,模型更加简洁。  相似文献   

9.
Traditionally the partial least-squares (PLS) algorithm, commonly used in chemistry for ill-conditioned multivariate linear regression, has been derived (motivated) and presented in terms of data matrices. In this work the PLS algorithm is derived probabilistically in terms of stochastic variables where sample estimates calculated using data matrices are employed at the end. The derivation, which offers a probabilistic motivation to each step of the PLS algorithm, is performed for the general multiresponse case and without reference to any latent variable model of the response variable and also without any so-called "inner relation". On the basis of the derivation, some theoretical issues of the PLS algorithm are briefly considered: the complexity of the original motivation of PLS regression which involves an "inner relation"; the original motivation behind the prediction stage of the PLS algorithm; the relationship between uncorrelated and orthogonal latent variables; the limited possibilities to make natural interpretations of the latent variables extracted.  相似文献   

10.
11.
Inductively Coupled Plasma Atomic Emission Spectroscopy measurements of six trace elements were performed on the scalp hair of 155 donors, 73 of which have been diagnosed with Hepatitis C and 82 Controls. Principal Components Analysis (PCA) was employed to visualise the separation between groups and show the relationship between the elements and the diseased state. Pattern recognition methods for classification involving Quadratic Discriminant Analysis and Partial Least Squares Discriminant Analysis (PLS-DA) were applied to the data. The number of significant components for both PCA and PLS were determined using the bootstrap. The stability of training set models were determined by repeatedly splitting the data into training and test sets and employing visualisation for two components models: the percent classification ability (CC), predictive ability (PA) and model stability (MS) were computed for test and training sets.  相似文献   

12.
Kernel partial least squares (KPLS) has become a popular technique for regression and classification of complex data sets, which is a nonlinear extension of linear PLS in which training samples are transformed into a feature space via a nonlinear mapping. The PLS algorithm can then be carried out in the feature space. In the present study, we attempt to develop a novel tree KPLS (TKPLS) classification algorithm by constructing an informative kernel on the basis of decision tree ensembles. The constructed tree kernel can effectively discover the similarities of samples and select informative features by variable importance ranking in the process of building the kernel. Simultaneously, TKPLS can also handle nonlinear relationships in the structure–activity relationship data by such a kernel. Finally, three data sets related to different categorical bioactivities of compounds are used to evaluate the performance of TKPLS. The results show that the TKPLS algorithm can be regarded as an alternative and promising classification technique. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
The partial least-squares (PLS) algorithm has become popular for explorative multivariate data analysis and for multivariate calibration. The same PLS algorithm can also be used for confirmatory data analysis. The discussion is limited to analysis of a single response variable. A close correspondence of PLS1 regression to classical analysis of variance (ANOVA) is demonstrated. The design of an experiment is described in terms of discrete design variables for main effects and simple interactions (dummy variables). These are used as regressors X = (x1, x2,…,) for modelling the response variable of the experiment, y. As in conventional use of PLS1 regression, the algorithm gives a concentrated model or diagram of the most important, y-relevant variability types in the X-data. In the present case, this gives the combination of design variables that models the variations in y. A simple plot of the resulting factor loadings immediately reveals the important design variables. Statistical tests and confidence regions in the PLS solution give additional safeguards against interpretation of spurious effects. The method is applied to two data sets. One concerns assessment of personal preference for blackcurrent juice, studied in a 25 factorial experiment; these data are also studied with missing values and as fractional factorials. The other ceoncers spectrophotometric absorbance-based colour assessments of pigment in strawberry jam in a 3-factor design with 2, 2 and 3 levels in the respective factors.  相似文献   

14.
The structure-activity relationship study of C-10 substituted artemisinin (QHS) derivatives that are used as antimalarial was performed with the RS (rough sets) method. An RS process is a concise nonlinear process, and it has broad application foreground in the data mining of nonlinear life courses. In this work, initially the parameters of C-10 substituted QHS’s derivatives were computed with the quantum chemistry method, and the information table was constructed from the parameters (condition attributes) and biological activity (decision attributes). Based on the analysis of rough set theory, the core and reduction of attributes sets were obtained. Then the decision rules were extracted and the struc-ture-activity relationship was analyzed. As a nonlinear system, RS theory can extract the special rela-tion in the database. It has the advantage of being nonlinear over multiple linear regression (MLR), principal component analysis (PCA), partial least square (PLS), etc., and the advantage of obtaining results with unambiguous physical meanings over artificial neuron networks (ANNs), etc. The result obtained in this study is instructive to the study of pharmacodynamics, resistance mechanism of QHS and development of QHS’s derivatives.  相似文献   

15.
The performance of Partial Least Squares regression (PLS) in predicting the output with multivariate cross‐ and autocorrelated data is studied. With many correlated predictors of varying importance PLS does not always predict well and we propose a modified algorithm, Partitioned Partial Least Squares (PPLS). In PPLS the predictors are partitioned into smaller subgroups and the important subgroups with high prediction power are identified. Finally, regular PLS analysis using only those subgroups is performed. The proposed Partitioned PLS (PPLS) algorithm is used in the analysis of data from a real pharmaceutical batch fermentation process for which the process variables follow certain profiles during a specific fermentation period. We observed that PPLS leads to a more accurate prediction of the yield of the fermentation process and an easier interpretation, since fewer predictors are used in the final PLS prediction. In the application important issues such as alignment of the profiles from one batch to another and standardization of the predictors are also addressed. For instance, in PPLS noise magnification due to standardization does not seem to create problems as it might in regular PLS. Finally, PPLS is compared to several recently proposed functional PLS and PCR methods and a genetic algorithm for variable selection. More specifically for a couple of publicly available data sets with near infrared spectra it is shown that overall PPLS has lower cross‐validated error than PLS, PCR and the functional modifications hereof, and is similar in performance to a more complex genetic algorithm. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

16.
IR and NIR spectra were correlated to Hildebrand and Hansen solubility parameters through use of multivariate data analysis. PLS‐1 models were developed and used to predict solubility parameters for solvents, crude oils, and SARA fractions. PLS regression showed potential for good correlation of the solubility parameters with IR and NIR spectra. Principal component analysis of IR spectra showed that crude oils are grouped according to their relative contents of heavy components such as asphaltenes. PCA of IR spectra for SARA fractions resulted in obvious groupings of the respective fractions. Prediction of solubility parameters from IR spectra of polymers, crude oils, and SARA fractions gave values that are comparable to literature values. This study indicates that correlation of solubility parameters with IR and NIR spectra is possible. In turn, it may be possible to develop models that can predict the polarities of crude oils and crude oil fractions such as resins and asphaltenes.  相似文献   

17.
We describe a method of performing trilinear analysis on large data sets using a modification of the PARAFAC‐ALS algorithm. Our method iteratively decomposes the data matrix into a core matrix and three loading matrices based on the Tucker1 model. The algorithm is particularly useful for data sets that are too large to upload into a computer's main memory. While the performance advantage in utilizing our algorithm is dependent on the number of data elements and dimensions of the data array, we have seen a significant performance improvement over operating PARAFAC‐ALS on the full data set. In one case of data comprising hyperspectral images from a confocal microscope, our method of analysis was approximately 60 times faster than operating on the full data set, while obtaining essentially equivalent results. Copyright © 2008 by John Wiley & Sons, Ltd.  相似文献   

18.
《Microchemical Journal》2008,88(2):119-127
An optimized model of multivariate classification for the monitoring of eighteen spring waters in the land of Serra St. Bruno, Calabria, Italy, has been developed. Thirty analytical parameters for each water source were investigated and reduced to eight by means of Principal Component Analysis (PCA). Water springs were grouped in five distinct classes by cluster techniques (CA) and a model for their classification was built by a Partial Least Squares–Discriminant Analysis (PLS–DA) procedure. The model was optimized and validated and then applied to new data matrices, containing the analytical parameters carried out on the same sources during the successive years. This model proved to be able to notice deviations of the global analytical characteristics, by pointing out in the course of time a different distribution of the samples within the classes. The variation of nitrate concentration was demonstrated to be the major responsible for the observed class shifts. The shifting sources were localized in areas used as sowable lands and high variability of nitrate content was ascribed to the practice of crop rotation, involving a varying use of the nitrogenous chemical fertilizers.  相似文献   

19.
Nine PLS1 algorithms were evaluated, primarily in terms of their numerical stability, and secondarily their speed. There were six existing algorithms: (a) NIPALS by Wold; (b) the non‐orthogonalized scores algorithm by Martens; (c) Bidiag2 by Golub and Kahan; (d) SIMPLS by de Jong; (e) improved kernel PLS by Dayal; and (f) PLSF by Manne. Three new algorithms were created: (g) direct‐scores PLS1 based on a new recurrent formula for the calculation of basis vectors yielding scores directly from X and y; (h) Krylov PLS1 with its regression vector defined explicitly, using only the original X and y; (i) PLSPLS1 with its regression vector recursively defined from X and the regression vectors of its previous recursions. Data from IR and NIR spectrometers applied to food, agricultural, and pharmaceutical products were used to demonstrate the numerical stability. It was found that three methods (c, f, h) create regression vectors that do not well resemble the corresponding precise PLS1 regression vectors. Because of this, their loading and score vectors were also concluded to be deviating, and their models of X and the corresponding residuals could be shown to be numerically suboptimal in a least squares sense. Methods (a, b, e, g) were the most stable. Two of them (e, g) were not only numerically stable but also much faster than methods (a, b). The fast method (d) and the moderately fast method (i) showed a tendency to become unstable at high numbers of PLS factors. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号