首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
The presence of multicollinearity in regression data is no exception in real life examples. Instead of applying ordinary regression methods, biased regression techniques such as principal component regression and ridge regression have been developed to cope with such datasets. In this paper, we consider partial least squares (PLS) regression by means of the SIMPLS algorithm. Because the SIMPLS algorithm is based on the empirical variance-covariance matrix of the data and on least squares regression, outliers have a damaging effect on the estimates. To reduce this pernicious effect of outliers, we propose to replace the empirical variance-covariance matrix in SIMPLS by a robust covariance estimator. We derive the influence function of the resulting PLS weight vectors and the regression estimates, and conclude that they will be bounded if the robust covariance estimator has a bounded influence function. Also the breakdown value is inherited from the robust estimator. We illustrate the results using the MCD estimator and the reweighted MCD estimator (RMCD) for low-dimensional datasets. Also some empirical properties are provided for a high-dimensional dataset.  相似文献   

2.
A novel method for underdetermined regression problems, multicomponent self-organizing regression (MCSOR), has been recently introduced. Here, its performance is compared with partial least-squares (PLS), which is perhaps the most widely adopted multivariate method in chemometrics. A potpourri of models is presented, and MCSOR appears to provide highly predictive models that are comparable with or better than the corresponding PLS models in large internal (leave-one-out, LOO) and pseudo-external (leave-many-out, LMO) validation tests. The "blind" external predictive ability of MCSOR and PLS is demonstrated employing large melting point, factor Xa, log P and log S data sets. In a nutshell, MCSOR is fast, conceptually simple (employing multiple linear regression, MLR, as a statistical tool), and applicable to all kinds of multivariate problems with single Y-variable.  相似文献   

3.
A novel method for underdetermined regression problems, multicomponent self-organizing regression (MCSOR), has been recently introduced. Here, its performance is compared with partial least-squares (PLS), which is perhaps the most widely adopted multivariate method in chemometrics. A potpourri of models is presented, and MCSOR appears to provide highly predictive models that are comparable with or better than the corresponding PLS models in large internal (leave-one-out, LOO) and pseudo-external (leave-many-out, LMO) validation tests. The “blind” external predictive ability of MCSOR and PLS is demonstrated employing large melting point, factor Xa, log?P and log?S data sets. In a nutshell, MCSOR is fast, conceptually simple (employing multiple linear regression, MLR, as a statistical tool), and applicable to all kinds of multivariate problems with single Y-variable.  相似文献   

4.
《Analytical letters》2012,45(13):2238-2254
A new variable selection method called ensemble regression coefficient analysis is reported on the basis of model population analysis. In order to construct ensemble regression coefficients, many subsets of variables are randomly selected to calibrate corresponding partial least square models. Based on ensemble theory, the mean of regression coefficients of the models is set as the ensemble regression coefficient. Subsequently, the absolute value of the ensemble regression coefficient can be applied as an informative vector for variable selection. The performance of ensemble regression coefficient analysis was assessed by four near infrared datasets: two simulated datasets, one wheat dataset, and one tobacco dataset. The results showed that this approach can select important variables to obtain fewer errors compared with regression coefficient analysis and Monte Carlo uninformative variable elimination.  相似文献   

5.
An evaluation of computational performance and precision regarding the cross‐validation error of five partial least squares (PLS) algorithms (NIPALS, modified NIPALS, Kernel, SIMPLS and bidiagonal PLS), available and widely used in the literature, is presented. When dealing with large data sets, computational time is an important issue, mainly in cross‐validation and variable selection. In the present paper, the PLS algorithms are compared in terms of the run time and the relative error in the precision obtained when performing leave‐one‐out cross‐validation using simulated and real data sets. The simulated data sets were investigated through factorial and Latin square experimental designs. The evaluations were based on the number of rows, the number of columns and the number of latent variables. With respect to their performance, the results for both simulated and real data sets have shown that the differences in run time are statistically different. PLS bidiagonal is the fastest algorithm, followed by Kernel and SIMPLS. Regarding cross‐validation error, all algorithms showed similar results. However, in some situations as, for example, when many latent variables were in question, discrepancies were observed, especially with respect to SIMPLS. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

6.
Sorbic (SOR) and benzoic (BEN) acids were determined in fruit juice samples by using a net analyte signal-based methodology named HLA/GO (an hybrid linear analysis presented by Goicoechea and Olivieri) applied to spectroscopic signals. The calibration set was built with several fruit juices in order to take into account the natural variability and concentrations of both analytes covering the range usually present in commercial samples. Relative errors of prediction (REP %) of 3.6 and 5.2% were calculated for SOR and BEN respectively. Several figures of merit were calculated-sensitivity, selectivity, analytical sensitivity, and limit of detection. The method is quantitative, with reasonably good recoveries and excellent precision (less than 1%). Wavelength selection was applied, based on the concept of net analyte signal regression, and it allowed us to improve the method performance in samples containing non-modelled interferences, e.g. fruit juices different to those used to build the calibration model.  相似文献   

7.
Ideally, the score vectors numerically computed by an orthogonal scores partial least squares (PLS) algorithm should be orthogonal close to machine precision. However, this is not ensured without taking special precautions. The progressive loss of orthogonality with increasing number of components is illustrated for two widely used PLS algorithms, i.e., one that can be considered as a standard PLS algorithm, and SIMPLS. It is shown that the original standard PLS algorithm outperforms the original SIMPLS in terms of numerical stability. However, SIMPLS is confirmed to perform much better in terms of speed. We have investigated reorthogonalization as the special precaution to ensure orthogonality close to machine precision. Since the increase of computing time is relatively small for SIMPLS, we therefore recommend SIMPLS with reorthogonalization. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

8.
Glide SP mode enrichment results for two preparations of the DUD dataset and native ligand docking RMSDs for two preparations of the Astex dataset are presented. Following a best-practices preparation scheme, an average RMSD of 1.140 ? for native ligand docking with Glide SP is computed. Following the same best-practices preparation scheme for the DUD dataset an average area under the ROC curve (AUC) of 0.80 and average early enrichment via the ROC (0.1?%) metric of 0.12 were observed. 74 and 56?% of the 39 best-practices prepared targets showed AUC over 0.7 and 0.8, respectively. Average AUC was greater than 0.7 for all best-practices protein families demonstrating consistent enrichment performance across a broad range of proteins and ligand chemotypes. In both Astex and DUD datasets, docking performance is significantly improved employing a best-practices preparation scheme over using minimally-prepared structures from the PDB. Enrichment results for WScore, a new scoring function and sampling methodology integrating WaterMap and Glide, are presented for four DUD targets, hivrt, hsp90, cdk2, and fxa. WScore performance in early enrichment is consistently strong and all systems examined show AUC?>?0.9 and superior early enrichment to DUD best-practices Glide SP results.  相似文献   

9.
Conventional X-ray microcomputed tomography (micro-CT) is not usually sufficient to determine microscopic compositional distributions as it is limited to measuring the X-ray attenuation of the sample, which for a given dataset can be similar for materials of different composition. In contrast, the present work enables three-dimensional compositional analysis with a data-constrained microstructure (DCM) modeling methodology, which uses two or more CT datasets acquired with different X-ray spectra and incorporates them as model constraints. For providing input data for DCM, we have also developed a method of micro-CT data collection that enables two datasets with different X-ray spectra to be acquired in parallel. Such data are used together with the DCM methodology to predict the distributions of corrosion inhibitor and filler in a polymer matrix. The DCM-predicted compositional microstructures have a reasonable agreement with energy dispersive X-ray images taken on the sample surface.  相似文献   

10.
Untargeted metabolomics based on liquid chromatography coupled with mass spectrometry (LC–MS) can detect thousands of features in samples and produce highly complex datasets. The accurate extraction of meaningful features and the building of discriminant models are two crucial steps in the data analysis pipeline of untargeted metabolomics. In this study, pure ion chromatograms were extracted from a liquor dataset and left-sided colon cancer (LCC) dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2). Then, the nonlinear low-dimensional embedding by uniform manifold approximation and projection (UMAP) showed the separation of samples from different groups in reduced dimensions. The discriminant models were established by extreme gradient boosting (XGBoost) based on the features extracted by KPIC2. Results showed that features extracted by KPIC2 achieved 100% classification accuracy on the test sets of the liquor dataset and the LCC dataset, which demonstrated the rationality of the XGBoost model based on KPIC2 compared with the results of XCMS (92% and 96% for liquor and LCC datasets respectively). Finally, XGBoost can achieve better performance than the linear method and traditional nonlinear modeling methods on these datasets. UMAP and XGBoost are integrated into KPIC2 package to extend its performance in complex situations, which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics.  相似文献   

11.
The use of classification and regression trees (CART) was studied in a quantitative structure-retention relationship (QSRR) context to predict the retention in 13 thin layer chromatographic screening systems on a silica gel, where large datasets of interlaboratory determined retention are available. The response (dependent variable) was the rate mobility (RM) factor, while a set of atomic contributions and functional substituent counts was used as an explanatory dataset. The trees were investigated against optimal complexity (number of the leaves) by external validation and internal crossvalidation. Their predictive performance is slightly lower than full atomic contribution model, but the main advantage is the simplicity. The retention prediction with the proposed trees can be done without computer or even pocket calculator.  相似文献   

12.
13.
With an increasing number of publicly available microarray datasets, it becomes attractive to borrow information from other relevant studies to have more reliable and powerful analysis of a given dataset. We do not assume that subjects in the current study and other relevant studies are drawn from the same population as assumed by meta-analysis. In particular, the set of parameters in the current study may be different from that of the other studies. We consider sample classification based on gene expression profiles in this context. We propose two new methods, a weighted partial least squares (WPLS) method and a weighted penalized partial least squares (WPPLS) method, to build a classifier by a combined use of multiple datasets. The methods can weight the individual datasets depending on their relevance to the current study. A more standard approach is first to build a classifier using each of the individual datasets, then to combine the outputs of the multiple classifiers using a weighted voting. Using two quite different datasets on human heart failure, we show first that WPLS/WPPLS, by borrowing information from the other dataset, can improve the performance of PLS/PPLS built on only a single dataset. Second, WPLS/WPPLS performs better than the standard approach of combining multiple classifiers. Third, WPPLS can improve over WPLS, just as PPLS does over PLS for a single dataset.  相似文献   

14.
15.
Kernel partial least squares (KPLS) has become popular techniques for chemical and biological modeling, which is a nonlinear extension of linear PLS. Training samples are transformed into a feature space via a nonlinear mapping, and then PLS algorithm can be carried out in the feature space. However, one of the main limitations of KPLS is that each feature is given the same importance in the kernel matrix, thus explaining the poor performance of KPLS for data with many irrelevant features. In this study, we provide a new strategy incorporated variable importance into KPLS, which is termed as the WKPLS approach. The WKPLS approach by modifying the kernel matrix provides a feasible way to differentiate between the true and noise variables. On the basis of the fact that the regression coefficients of the PLS model reflect the importance of variables, we firstly obtain the normalized regression coefficients by establishing the PLS model with all the variables. Then, Variable importance is incorporated into primary kernel. The performance of WKPLS is investigated with one simulated dataset and two structure–activity relationship (SAR) datasets. Compared with standard linear kernel PLS and Gaussian kernel PLS, The results show that WKPLS yields superior prediction performances to standard KPLS. WKPLS could be considered as a good mechanism by introducing extra information to improve the performance of KPLS for modeling SAR.  相似文献   

16.
17.
The most straightforward method to analyze an obtained GC–MS dataset is to integrate those peaks that can be identified by their MS profile and to perform a Principal Component Analysis (PCA). This procedure has some important drawbacks, like baseline drifts being scarcely considered or the fact that integration boundaries are not always well defined (long tails, co-eluted peaks, etc.). To improve the methodology, and therefore, the chromatographic data analysis, this work proposes the modeling of the raw dataset by using PARAFAC2 algorithm in selected areas of the GC profile and using the obtained well-resolved chromatographic profiles to develop a further PCA model. With this working method, not only the problems arising from instrumental artifacts are overcome, but also the detection of new analytes is achieved as well as better understanding of the studied dataset is obtained. As a positive consequence of using the proposed working method human time and work are saved. To exemplify this methodology the aroma profile of 36 apples being ripened were studied. The benefits of the proposed methodology (PARAFAC2 + PCA) are shown in a practitioner perspective, being able to extrapolate the conclusions obtained here to other hyphenated chromatographic datasets.  相似文献   

18.
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.  相似文献   

19.
An approach for the analysis of large experimental datasets in electrochemical impedance spectroscopy (EIS) has been developed. The approach uses the idea of successive Bayesian estimation and splits the multidimensional EIS datasets into parts with reduced dimensionality. Afterwards, estimation of the parameters of the EIS-models is performed successively, from one part to another, using complex nonlinear least squares (CNLS) method. The results obtained on the previous step are used as a priori values (in the Bayesian form) for the analysis of the next part. To provide high stability of the sequential CNLS minimisation procedure, a new hybrid algorithm has been developed. This algorithm fits the datasets of reduced dimensionality to the selected EIS models, provides high stability of the fitting and allows semi-automatic data analysis on a reasonable timescale. The hybrid algorithm consists of two stages in which different zero-order optimisation strategies are used, reducing both the computational time and the probability to overlook the global optimum. The performance of the developed approach has been evaluated using (i) simulated large EIS dataset which represents a possible output of a scanning electrochemical impedance microscopy experiments, and (ii) experimental dataset, where EIS spectra were acquired as a function of the electrode potential and time. The developed data analysis strategy showed promise and can be further extended to other electroanalytical EIS applications which require multidimensional data analysis.  相似文献   

20.
Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号