期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An optimization‐based undeflated PLS (OUPLS) method to handle missing data in the training set

Eranda Harinath Puwakkatiya‐Kankanamage Salvador García‐Muoz Lorenz T. Biegler 《Journal of Chemometrics》2014,28(7):575-584

Advances in sensory systems have led to many industrial applications with large amounts of highly correlated data, particularly in chemical and pharmaceutical processes. With these correlated data sets, it becomes important to consider advanced modeling approaches built to deal with correlated inputs in order to understand the underlying sources of variability and how this variability will affect the final quality of the product. Additional to the correlated nature of the data sets, it is also common to find missing elements and noise in these data matrices. Latent variable regression methods such as partial least squares or projection to latent structures (PLS) have gained much attention in industry for their ability to handle ill‐conditioned matrices with missing elements. This feature of the PLS method is accomplished through the nonlinear iterative PLS (NIPALS) algorithm, with a simple modification to consider the missing data. Moreover, in expectation maximization PLS (EM‐PLS), imputed values are provided for missing data elements as initial estimates, conventional PLS is then applied to update these elements, and the process iterates to convergence. This study is the extension of previous work for principal component analysis (PCA), where we introduced nonlinear programming (NLP) as a means to estimate the parameters of the PCA model. Here, we focus on the parameters of a PLS model. As an alternative to modified NIPALS and EM‐PLS, this paper presents an efficient NLP‐based technique to find model parameters for PLS, where the desired properties of the parameters can be explicitly posed as constraints in the optimization problem of the proposed algorithm. We also present a number of simulation studies, where we compare effectiveness of the proposed algorithm with competing algorithms. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

2.

Assessment of maximum likelihood PCA missing data imputation

Abel Folch‐Fortuny Francisco Arteaga Alberto Ferrer 《Journal of Chemometrics》2016,30(7):386-393

Maximum likelihood principal component analysis (MLPCA) was originally proposed to incorporate measurement error variance information in principal component analysis (PCA) models. MLPCA can be used to fit PCA models in the presence of missing data, simply by assigning very large variances to the non‐measured values. An assessment of maximum likelihood missing data imputation is performed in this paper, analysing the algorithm of MLPCA and adapting several methods for PCA model building with missing data to its maximum likelihood version. In this way, known data regression (KDR), KDR with principal component regression (PCR), KDR with partial least squares regression (PLS) and trimmed scores regression (TSR) methods are implemented within the MLPCA method to work as different imputation steps. Six data sets are analysed using several percentages of missing data, comparing the performance of the original algorithm, and its adapted regression‐based methods, with other state‐of‐the‐art methods. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

3.

主成分分析-紫外光谱法同时测定四种维生素B

余般梅石乐明许志宏李志良《分析测试学报》1992,(4)

采用主成分分析法(PCA)完成对多组分样品分析的建模及解析研究,用于处理紫外光谱数据,实现了维生素B_1、B_2及B_6及烟酰胺四组分的同时测定,结果可靠,操作简便。相似文献

4.

The geometry of PLS1 explained properly: 10 key notes on mathematical properties of and some alternative algorithmic approaches to PLS1 modelling

Ulf G. Indahl 《Journal of Chemometrics》2014,28(3):168-180

The insight from, and conclusions of this paper motivate efficient and numerically robust ‘new’ variants of algorithms for solving the single response partial least squares regression (PLS1) problem. Prototype MATLAB code for these variants are included in the Appendix. The analysis of and conclusions regarding PLS1 modelling are based on a rich and nontrivial application of numerous key concepts from elementary linear algebra. The investigation starts with a simple analysis of the nonlinear iterative partial least squares (NIPALS) PLS1 algorithm variant computing orthonormal scores and weights. A rigorous interpretation of the squared P ‐loadings as the variable‐wise explained sum of squares is presented. We show that the orthonormal row‐subspace basis of W ‐weights can be found from a recurrence equation. Consequently, the NIPALS deflation steps of the centered predictor matrix can be replaced by a corresponding sequence of Gram–Schmidt steps that compute the orthonormal column‐subspace basis of T ‐scores from the associated non‐orthogonal scores. The transitions between the non‐orthogonal and orthonormal scores and weights (illustrated by an easy‐to‐grasp commutative diagram), respectively, are both given by QR factorizations of the non‐orthogonal matrices. The properties of singular value decomposition combined with the mappings between the alternative representations of the PLS1 ‘truncated’ X data (including P ^t W ) are taken to justify an invariance principle to distinguish between the PLS1 truncation alternatives. The fundamental orthogonal truncation of PLS1 is illustrated by a Lanczos bidiagonalization type of algorithm where the predictor matrix deflation is required to be different from the standard NIPALS deflation. A mathematical argument concluding the PLS1 inconsistency debate (published in 2009 in this journal) is also presented. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

5.

主成分-人工神经网络在近红外光谱定量分析中的应用 总被引：13，自引：0，他引：13

吉海彦严衍禄《分析测试学报》1999,18(3)

近红外光谱的主成分由非线性迭代偏最小二乘法（ＮＩＰＡＬＳ）求出。主成分作标准化处理后,作为Ｂ－Ｐ神经网络的输入结点进行非线性迭代。该法的优点是,充分利用了全光谱的数据,得到消除噪声后的最佳主成分,能建立非线性模型,Ｂ－Ｐ神经网络迭代时间显著缩短。用该法对大麦中的淀粉含量进行了定量分析研究。结果为：校准和预测的相关系数分别为０．９８１和０．９５３,校准和预测的相对标准偏差分别为１．７０％和２．４８％。相似文献

6.

Computational performance and cross‐validation error precision of five PLS algorithms using designed and real data sets

Joo Paulo A. Martins Reinaldo F. Tefilo Mrcia M. C. Ferreira 《Journal of Chemometrics》2010,24(6):320-332

An evaluation of computational performance and precision regarding the cross‐validation error of five partial least squares (PLS) algorithms (NIPALS, modified NIPALS, Kernel, SIMPLS and bidiagonal PLS), available and widely used in the literature, is presented. When dealing with large data sets, computational time is an important issue, mainly in cross‐validation and variable selection. In the present paper, the PLS algorithms are compared in terms of the run time and the relative error in the precision obtained when performing leave‐one‐out cross‐validation using simulated and real data sets. The simulated data sets were investigated through factorial and Latin square experimental designs. The evaluations were based on the number of rows, the number of columns and the number of latent variables. With respect to their performance, the results for both simulated and real data sets have shown that the differences in run time are statistically different. PLS bidiagonal is the fastest algorithm, followed by Kernel and SIMPLS. Regarding cross‐validation error, all algorithms showed similar results. However, in some situations as, for example, when many latent variables were in question, discrepancies were observed, especially with respect to SIMPLS. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

7.

Principal component analysis versus fuzzy principal component analysis A case study: the quality of danube water (1985-1996) 总被引：2，自引：0，他引：2

Sârbu C Pop HF 《Talanta》2005,65(5):1215-1220

Principal component analysis (PCA) is a favorite tool in environmetrics for data compression and information extraction. PCA finds linear combinations of the original measurement variables that describe the significant variations in the data. However, it is well-known that PCA, as with any other multivariate statistical method, is sensitive to outliers, missing data, and poor linear correlation between variables due to poorly distributed variables. As a result data transformations have a large impact upon PCA. In this regard one of the most powerful approach to improve PCA appears to be the fuzzification of the matrix data, thus diminishing the influence of the outliers. In this paper we discuss and apply a robust fuzzy PCA algorithm (FPCA). The efficiency of the new algorithm is illustrated on a data set concerning the water quality of the Danube River for a period of 11 consecutive years. Considering, for example, a two component model, FPCA accounts for 91.7% of the total variance and PCA accounts only for 39.8%. Much more, PCA showed only a partial separation of the variables and no separation of scores (samples) onto the plane described by the first two principal components, whereas a much sharper differentiation of the variables and scores is observed when FPCA is applied. 相似文献

8.

Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer

Darren T. Andrews Peter D. Wentzell 《Analytica chimica acta》1997,350(3):1467-352

The application of a new method to the multivariate analysis of incomplete data sets is described. The new method, called maximum likelihood principal component analysis (MLPCA), is analogous to conventional principal component analysis (PCA), but incorporates measurement error variance information in the decomposition of multivariate data. Missing measurements can be handled in a reliable and simple manner by assigning large measurement uncertainties to them. The problem of missing data is pervasive in chemistry, and MLPCA is applied to three sets of experimental data to illustrate its utility. For exploratory data analysis, a data set from the analysis of archeological artifacts is used to show that the principal components extracted by MLPCA retain much of the original information even when a significant number of measurements are missing. Maximum likelihood projections of censored data can often preserve original clusters among the samples and can, through the propagation of error, indicate which samples are likely to be projected erroneously. To demonstrate its utility in modeling applications, MLPCA is also applied in the development of a model for chromatographic retention based on a data set which is only 80% complete. MLPCA can predict missing values and assign error estimates to these points. Finally, the problem of calibration transfer between instruments can be regarded as a missing data problem in which entire spectra are missing on the ‘slave’ instrument. Using NIR spectra obtained from two instruments, it is shown that spectra on the slave instrument can be predicted from a small subset of calibration transfer samples even if a different wavelength range is employed. Concentration prediction errors obtained by this approach were comparable to cross-validation errors obtained for the slave instrument when all spectra were available. 相似文献

9.

Stochastic proximity embedding on graphics processing units: taking multidimensional scaling to a new scale

Yang E Liu P Rassokhin DN Agrafiotis DK 《Journal of chemical information and modeling》2011,51(11):2852-2859

Stochastic proximity embedding (SPE) was developed as a method for efficiently calculating lower dimensional embeddings of high-dimensional data sets. Rather than using a global minimization scheme, SPE relies upon updating the distances of randomly selected points in an iterative fashion. This was found to generate embeddings of comparable quality to those obtained using classical multidimensional scaling algorithms. However, SPE is able to obtain these results in O(n) rather than O(n2) time and thus is much better suited to large data sets. In an effort both to speed up SPE and utilize it for even larger problems, we have created a multithreaded implementation which takes advantage of the growing general computing power of graphics processing units (GPUs). The use of GPUs allows the embedding of data sets containing millions of data points in interactive time scales. 相似文献

10.

Using combinations of principal component scores from different spectral ranges in near-infrared region to improve discrimination for samples of complex composition

Sanguk Lee Hangseok Choi 《Microchemical Journal》2010,95(1):96-101

Principal component analysis (PCA) is widely used as an exploratory data analysis tool in the field of vibrational spectroscopy, particularly near-infrared (NIR) spectroscopy. PCA represents original spectral data containing large variables into a few feature-containing variables, or scores. Although multiple spectral ranges can be simultaneously used for PCA, only one series of scores generated by merging the selected spectral ranges is generally used for qualitative analysis. Alternatively, the combined use of an independent series of scores generated from separate spectral ranges has not been exploited.The aim of this study is to evaluate the use of PCA to discriminate between two geographical origins of sesame samples, when scores independently generated from separate spectral ranges are optimally combined. An accurate and rapid analytical method to determine the origin is essentially required for the correct value estimation and proper production distribution. Sesame is chosen in this study because it is difficult to visually discriminate the geographical origins and its composition is highly complex. For this purpose, we collected diffuse reflectance near-infrared (NIR) spectroscopic data from geographically diverse sesame samples over a period of eight years. The discrimination error obtained by applying linear discriminant analysis (LDA) was improved when separate scores from two spectral ranges were optimally combined, compared to the discrimination errors obtained when scores from singly merged two spectral ranges were used. 相似文献

11.

Dealing with missing values and outliers in principal component analysis

I. Stanimirova 《Talanta》2007,72(1):172-178

An efficient methodology for dealing with missing values and outlying observations simultaneously in principal component analysis (PCA) is proposed. The concept described in the paper consists of using a robust technique to obtain robust principal components combined with the expectation maximization approach to process data with missing elements. It is shown that the proposed strategy works well for highly contaminated data containing different amounts of missing elements. The authors come to this conclusion on the basis of the results obtained from a simulation study and from analysis of a real environmental data set. 相似文献

12.

Classification of gasoline data obtained by gas chromatography using a piecewise alignment algorithm combined with feature selection and principal component analysis

Pierce KM Hope JL Johnson KJ Wright BW Synovec RE 《Journal of chromatography. A》2005,1096(1-2):101-110

A fast and objective chemometric classification method is developed and applied to the analysis of gas chromatography (GC) data from five commercial gasoline samples. The gasoline samples serve as model mixtures, whereas the focus is on the development and demonstration of the classification method. The method is based on objective retention time alignment (referred to as piecewise alignment) coupled with analysis of variance (ANOVA) feature selection prior to classification by principal component analysis (PCA) using optimal parameters. The degree-of-class-separation is used as a metric to objectively optimize the alignment and feature selection parameters using a suitable training set thereby reducing user subjectivity, as well as to indicate the success of the PCA clustering and classification. The degree-of-class-separation is calculated using Euclidean distances between the PCA scores of a subset of the replicate runs from two of the five fuel types, i.e., the training set. The unaligned training set that was directly submitted to PCA had a low degree-of-class-separation (0.4), and the PCA scores plot for the raw training set combined with the raw test set failed to correctly cluster the five sample types. After submitting the training set to piecewise alignment, the degree-of-class-separation increased (1.2), but when the same alignment parameters were applied to the training set combined with the test set, the scores plot clustering still did not yield five distinct groups. Applying feature selection to the unaligned training set increased the degree-of-class-separation (4.8), but chemical variations were still obscured by retention time variation and when the same feature selection conditions were used for the training set combined with the test set, only one of the five fuels was clustered correctly. However, piecewise alignment coupled with feature selection yielded a reasonably optimal degree-of-class-separation for the training set (9.2), and when the same alignment and ANOVA parameters were applied to the training set combined with the test set, the PCA scores plot correctly classified the gasoline fingerprints into five distinct clusters. 相似文献

13.

主成分分析方法在核酸碱基量子化学计算数据处理中的应用

刘世熙曹槐栗晻谢小光刘次全《化学研究与应用》2002,14(3):293-295

由于碱基在核酸中的重要性 ,多年来一直有关于碱基的理论计算报道[1～ 7] 。本文将化学计量学中的主成分分析方法[8] 用来分析五种碱基 :腺嘌呤 (Ａ)、鸟嘌呤 (Ｇ)、胞嘧啶 (Ｃ)、尿嘧啶 (Ｕ)和胸腺嘧啶 (Ｔ)计算结果的几何参数 ,以期取得有用的结构信息。1　方法通过ＡＣＤＣｈｅｍＳｋｅｔｃｈ 3 .5 [9] 的三维优化 (分子力学方法ＣＨＡＲＭＭ力场 )获得碱基的起始几何结构 ,其原子编号见图 1。所有的计算均采用Ｇａｕｓｓｉａｎ 94程序[10 ] 在ＩＢＭＰＣ兼容机上完成。首先 ,对 5种碱基作了 6种半经验方法 (ＡＭ1、ＰＭ3、ＭＮＤＯ、… 相似文献

14.

Study on the quantitative structureactivity relation-ship of C-10 substituted artemisinin (QHS)’s derivatives using rough set theory

Hao Liu LingBo Qu HongBin Gao JinXiang Wang LiPing Han BingRen Xiang 《中国科学B辑(英文版)》2008,51(10):937-945

The structure-activity relationship study of C-10 substituted artemisinin (QHS) derivatives that are used as antimalarial was performed with the RS (rough sets) method. An RS process is a concise nonlinear process, and it has broad application foreground in the data mining of nonlinear life courses. In this work, initially the parameters of C-10 substituted QHS’s derivatives were computed with the quantum chemistry method, and the information table was constructed from the parameters (condition attributes) and biological activity (decision attributes). Based on the analysis of rough set theory, the core and reduction of attributes sets were obtained. Then the decision rules were extracted and the struc-ture-activity relationship was analyzed. As a nonlinear system, RS theory can extract the special rela-tion in the database. It has the advantage of being nonlinear over multiple linear regression (MLR), principal component analysis (PCA), partial least square (PLS), etc., and the advantage of obtaining results with unambiguous physical meanings over artificial neuron networks (ANNs), etc. The result obtained in this study is instructive to the study of pharmacodynamics, resistance mechanism of QHS and development of QHS’s derivatives. 相似文献

15.

A journey into low-dimensional spaces with autoassociative neural networks 总被引：4，自引：0，他引：4

Daszykowski M Walczak B Massart DL 《Talanta》2003,59(6):1095-1105

The compression and the visualization of the data have been always a subject of a great deal of excitement. Since multidimensional data sets are difficult to interpret and visualize, much of the attention is drawn how to compress them efficiently. Usually, the compression of dimensionality is considered as the first step of exploratory data analysis. Here, we focus our attention on autoassociative neural networks (ANNs), which in a very elegant manner provide data compression and visualization. ANNs can deal with linear and nonlinear correlation among variables, what makes them a very powerful tool in exploratory data analysis. In the literature, ANNs are often referred as nonlinear principal component analysis (PCA), and due to their specific structure they are also known as bottleneck neural networks. In this paper, ANNs are discussed in details. Different training modes are described and illustrated on real example. The usefulness of ANNs for nonlinear data compression and visualization purposes is proven with the aid of chemical data sets, being the subject of analysis. The comparison of ANNs with well-known PCA is also presented. 相似文献

16.

Parallelizing Nonlinear Least-Squares Regression with Application to Analyses of Microalgae

Frank Vogt Robert K. Byrd 《Analytical letters》2017,50(6):945-963

Nonlinear least-squares regression is a valuable tool for gaining chemical insights into complex systems. Yet, the success of nonlinear regression as measured by residual sum of squares (RSS), correlation, and reproducibility of fit parameters strongly depends on the availability of a good initial solution. Without such, iterative algorithms quickly become trapped in an unfavorable local RSS-minimum. For determining an initial solution, a high-dimensional parameter space needs to be screened, a process that is very time-consuming but can be parallelized. Another advantage of parallelization is equally important: After determining initial solutions, the used processors can be tasked to each optimize an initial guess. Even if several of these optimizations become stuck in a shallow local RSS-minimum, other processors continue and improve the regression outcome. A software package for parallel processing-based constrained nonlinear regression (RegressionLab) has been developed, implemented, and tested on a variety of hardware configurations. As proof-of-principle, microalgae to environment interactions have been studied by infrared attenuated total reflection spectroscopy. Additionally, light microscopy has been used to monitor cell production. It is shown that spectroscopic data sets with 10,000?s of data points and >1000 nonlinear model parameters as well as imaging data with 100,000s of data points and >2000 nonlinear model parameters may now be investigated by constrained nonlinear regression. Acceleration factors of up to 8.1 have been obtained which is of high practical relevance when computations take weeks on single-processor machines. Solely using parallel processing, the RSS values may be improved up to a factor of 5.5. 相似文献

17.

Robust fuzzy principal component analysis (FPCA). A comparative study concerning interaction of carbon-hydrogen bonds with molybdenum-oxo bonds

Cundari TR Sârbu C Pop HF 《Journal of chemical information and computer sciences》2002,42(6):1363-1369

Principal component analysis (PCA) is a favorite tool in chemometrics for data compression and information extraction. PCA finds linear combinations of the original measurement variables that describe the significant variations in the data. However, it is well-known that PCA, as with any other multivariate statistical method, is sensitive to outliers, missing data, and poor linear correlation between variables due to poorly distributed variables. As a result data transformations have a large impact upon PCA. In this regard one of the most powerful approaches to improve PCA appears to be the fuzzification of the matrix data, thus diminishing the influence of outliers. In this paper we discuss a robust fuzzy PCA algorithm (FPCA). The new algorithm is illustrated on a data set concerning interaction of carbon-hydrogen bonds with transition metal-oxo bonds in molybdenum complexes. Considering, for example, a two component model, FPCA accounts for 97.20% of the total variance and PCA accounts only for 69.75%. 相似文献

18.

Use of multivariate methods in the analysis of calculated reaction pathways

Bjrn K. Alsberg Vidar R. Jensen Knut J. Brve 《Journal of computational chemistry》1996,17(10):1197-1216

相似文献

19.

Experimental design applied to spin coating of 2D colloidal crystal masks: a relevant method?

Colson P Cloots R Henrist C 《Langmuir : the ACS journal of surfaces and colloids》2011,27(21):12800-12806

Monolayers of colloidal spheres are used as masks in nanosphere lithography (NSL) for the selective deposition of nanostructured layers. Several methods exist for the formation of self-organized particle monolayers, among which spin coating appears to be very promising. However, a spin coating process is defined by several parameters like several ramps, rotation speeds, and durations. All parameters influence the spreading and drying of the droplet containing the particles. Moreover, scientists are confronted with the formation of numerous defects in spin coated layers, limiting well-ordered areas to a few micrometers squared. So far, empiricism has mainly ruled the world of nanoparticle self-organization by spin coating, and much of the literature is experimentally based. Therefore, the development of experimental protocols to control the ordering of particles is a major goal for further progress in NSL. We applied experimental design to spin coating, to evaluate the efficiency of this method to extract and model the relationships between the experimental parameters and the degree of ordering in the particles monolayers. A set of experiments was generated by the MODDE software and applied to the spin coating of latex suspension (diameter 490 nm). We calculated the ordering by a homemade image analysis tool. The results of partial least squares (PLS) modeling show that the proposed mathematical model only fits data from strictly monolayers but is not predictive for new sets of parameters. We submitted the data to principal component analysis (PCA) that was able to explain 91% of the results when based on strictly monolayered samples. PCA shows that the ordering was positively correlated to the ramp time and negatively correlated to the first rotation speed. We obtain large defect-free domains with the best set of parameters tested in this study. This protocol leads to areas of 200 μm(2), which has never been reported so far. 相似文献

20.

Don’t Overweight Weights: Evaluation of Weighting Strategies for Multi-Task Bioactivity Classification Models

Lina Humbeck Tobias Morawietz Noe Sturm Adam Zalewski Simon Harnqvist Wouter Heyndrickx Matthew Holmes Bernd Beck 《Molecules (Basel, Switzerland)》2021,26(22)

Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning. 相似文献