首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Outlier detection is crucial in building a highly predictive model. In this study, we proposed an enhanced Monte Carlo outlier detection method by establishing cross‐prediction models based on determinate normal samples and analyzing the distribution of prediction errors individually for dubious samples. One simulated and three real datasets were used to illustrate and validate the performance of our method, and the results indicated that this method outperformed Monte Carlo outlier detection in outlier diagnosis. After these outliers were removed, the value of validation by Kovats retention indices and the root mean square error of prediction decreased from 3.195 to 1.655, and the average cross‐validation prediction error decreased from 2.0341 to 1.2780. This method helps establish a good model by eliminating outliers. © 2015 Wiley Periodicals, Inc.  相似文献   

2.
Widely used regression approaches in modeling quantitative structure-property relationships, such as PLS regression, are highly susceptible to outlying observations that will impair the prognostic value of a model. Our aim is to compile homogeneous datasets as the basis for regression modeling by removing outlying compounds and applying variable selection. We investigate different approaches to create robust, outlier-resistant regression models in the field of prediction of drug molecules' permeability. The objective is to join the strength of outlier detection and variable elimination increasing the predictive power of prognostic regression models. In conclusion, outlier detection is employed to identify multiple, homogeneous data subsets for regression modeling.  相似文献   

3.
4.
Near-infrared (NIR) spectrometry will present a more promising tool for quantitative measurement if the robustness and predictive ability of the partial least square (PLS) model are improved. In order to achieve the purpose, we present a new algorithm for simultaneous wavelength selection and outlier detection; at the same time, the problems of background and noise in multivariate calibration are also solved. The strategy is a combination of continuous wavelet transform (CWT) and modified iterative predictors and objects weighting PLS (mIPOW-PLS). CWT is performed as a pretreatment tool for eliminating background and noise synchronously; then, mIPOW-PLS is proposed to remove both the useless wavelengths and the multiple outliers in CWT domain. After pretreatment with CWT-mIPOW-PLS, a PLS model is built finally for prediction. The results indicate that the combination of CWT and mIPOW-PLS produces robust and parsimonious regression models with very few wavelengths.  相似文献   

5.
An outlier detection method is proposed for near-infrared spectral analysis. The underlying philosophy of the method is that,in random test(Monte Carlo) cross-validation,the probability of outliers presenting in good models with smaller prediction residual error sum of squares(PRESS) or in bad models with larger PRESS should be obviously different from normal samples. The method builds a large number of PLS models by using random test cross-validation at first,then the models are sorted by the PRESS,and at last the outliers are recognized according to the accumulative probability of each sample in the sorted models. For validation of the proposed method,four data sets,including three published data sets and a large data set of tobacco lamina,were investigated. The proposed method was proved to be highly efficient and veracious compared with the conventional leave-one-out(LOO) cross validation method.  相似文献   

6.
采用CARS(Competitive adaptive reweighted sampling)变量筛选方法建模,显著提高了液态奶中蛋白质与脂肪近红外模型的预测精度。用蒙特卡罗采样(Monte-Carlo sampling)方法先剔除奇异样本,再对光谱进行中心化与Karl Norris滤波降噪处理,通过CARS方法筛选出与样本性质密切相关的变量,建立预测蛋白质与脂肪含量的偏最小二乘法(PLS)校正模型,并与未选变量的PLS模型进行比较。以定标集相关系数(r2)及交互验证均方残差(RMSECV)和预测误差均方根(RMSEP)作为判定依据,确定了蛋白质与脂肪的最佳建模条件。蛋白质与脂肪校正模型的相关系数分别为0.975 0、0.995 1,RMSECV分别为0.194 8、0.136 3,RMSEP分别为0.113 3、0.140 1,预测结果优于未选变量的PLS模型及其他选变量方法,有效简化了模型,适于液态奶中脂肪和蛋白质的快速、无损检测。  相似文献   

7.
Two-phase flow of liquids in pipelines is crucial subject in many industries such as chemical and petroleum. Accurate prediction of pressure gradient will lead to a better design of an energy efficient transportation system. Although numerous studies for prediction of two-phase flowing pressure drop have been reported in the literature, the accurate prediction of this parameter has been a topic of debate in many research areas. In this article, a novel model based on least square support vector (LSSVM) was proposed for calculation of two-phase flowing pressure drop in horizontal pipes. The inputs of this model are oil and water superficial velocities, pipe diameter, pipe roughness, and oil viscosity. To develop and test the model, more than 700 experimental dataset from open literature were utilized. The results of proposed model were compared against the well-known empirical correlations. Statistical error analysis showed that the LSSVM model outperforms existing predictive models. Finally, an outlier diagnosis was performed to detect the doubtful experimental.   相似文献   

8.
Robust cross-validation of linear regression QSAR models   总被引:1,自引:0,他引:1  
A quantitative structure-activity relationship (QSAR) model is typically developed to predict the biochemical activity of untested compounds from the compounds' molecular structures. "The gold standard" of model validation is the blindfold prediction when the model's predictive power is assessed from how well the model predicts the activity values of compounds that were not considered in any way during the model development/calibration. However, during the development of a QSAR model, it is necessary to obtain some indication of the model's predictive power. This is often done by some form of cross-validation (CV). In this study, the concepts of the predictive power and fitting ability of a multiple linear regression (MLR) QSAR model were examined in the CV context allowing for the presence of outliers. Commonly used predictive power and fitting ability statistics were assessed via Monte Carlo cross-validation when applied to percent human intestinal absorption, blood-brain partition coefficient, and toxicity values of saxitoxin QSAR data sets, as well as three known benchmark data sets with known outlier contamination. It was found that (1) a robust version of MLR should always be preferred over the ordinary-least-squares MLR, regardless of the degree of outlier contamination and that (2) the model's predictive power should only be assessed via robust statistics. The Matlab and java source code used in this study is freely available from the QSAR-BENCH section of www.dmitrykonovalov.org for academic use. The Web site also contains the java-based QSAR-BENCH program, which could be run online via java's Web Start technology (supporting Windows, Mac OSX, Linux/Unix) to reproduce most of the reported results or apply the reported procedures to other data sets.  相似文献   

9.
CO2 flooding accounts for a considerable proportion in gas flooding. Using CO2 as a gas displacement agent is benefit for enhanced oil recovery (EOR), and the alleviation of the greenhouse effect by the permanent storage of CO2 in the crust. Minimum miscibility pressure (MMP) of CO2‐oil is a key factor affecting EOR, which determines the yield and economic benefit of crude oil recovery. Therefore, it is of great importance to use fast, accurate and cheap prediction methods for MMP estimation. In the present study, to evaluate the reliability of four recently developed prediction models based on machine learning (i.e., neural network analysis (NNA), genetic function approximation (GFA), multiple linear regression (MLR), partial least squares (PLS)), 136 sets of data are selected for calculation via outlier analysis from 147 sets of data. Afterwards, we compared the four models with existing prediction models from the literature. The analysis of correlation coefficients and multiple error functions shows that the four models can solve the MMP prediction problem well, and the model using intelligent algorithm has a higher prediction accuracy than the simple linear model. Besides, intelligent methods based on similarity algorithm have little difference from each other. Finally, a sensitivity analysis was conducted.  相似文献   

10.
This article studies calibration maintenance and transfer to build a statistical model that is able to predict analyte concentrations by a set of spectra. Noticing that the wavelength atoms are naturally ordered in a meaningful way, we propose a novel robust fused LASSO (RFL) based on high‐dimensional sparsity techniques and a recent Θ‐IPOD technique for robustification. This new approach can attain simultaneous wavelength selection and grouping as well as outlier identification, without any human intervention. An efficient and scalable algorithm is developed on the basis of the alternating direction method of multipliers. The obtained RFL model is sparse and shows improved prediction performance over the LASSO and ridge regression. Our results reveal that wavelengths can be combined into blocks, in a smart manner, to enhance the interpretability and reliability for super‐resolution spectral analysis. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
12.
13.
14.
炉内结渣是影响火电机组和气化工艺可靠运行的关键因素之一,准确预测灰熔点可以提前调整炉膛出口温度以避免结渣。本论文采用激光诱导击穿光谱(LIBS)采集煤灰样中金属元素的光谱,分别建立煤灰中的金属元素的谱线强度与煤灰熔点的随机森林模型、支持向量机回归模型和线性回归模型,直接预测煤灰熔点温度。采用基于马氏距离(MD)的异常数据剔除算法和基于稀疏矩阵的基线估计与降噪算法(BEADS),对粉煤灰样的全光谱数据进行了预处理。随机森林模型对粉煤灰熔点的预测平均相对误差(MRE)为54.74%,支持向量机回归模型的预测平均相对误差为60.08%,而线性回归模型的预测平均相对误差达到了9.78%。研究结果表明,线性回归模型对煤灰熔点的预测结果更准确。  相似文献   

15.
Two novel algorithms which employ the idea of stacked generalization or stacked regression, stacked partial least squares (SPLS) and stacked moving‐window partial least squares (SMWPLS) are reported in the present paper. The new algorithms establish parallel, conventional PLS models based on all intervals of a set of spectra to take advantage of the information from the whole spectrum by incorporating parallel models in a way to emphasize intervals highly related to the target property. It is theoretically and experimentally illustrated that the predictive ability of these two stacked methods combining all subsets or intervals of the whole spectrum is never poorer than that of a PLS model based only on the best interval. These two stacking algorithms generate more parsimonious regression models with better predictive power than conventional PLS, and perform best when the spectral information is neither isolated to a single, small region, nor spread uniformly over the response. A simulation data set is employed in this work not only to demonstrate this improvement, but also to demonstrate that stacked regressions have the potential capability of predicting property information from an outlier spectrum in the prediction set. Moisture, oil, protein and starch in Cargill corn samples have been successfully predicted by these new algorithms, as well as hydroxyl number for different instruments of terpolymer samples including and excluding an outlier spectrum. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

16.
A novel outlier detection method in partial least squares based on random sample consensus is proposed. The proposed algorithm repeatedly generates partial least squares solutions estimated from random samples and then tests each solution for the support from the complete dataset for consistency. A comparative study of the proposed method and leave-one-out cross validation in outlier detection on simulated data and near-infrared data of pharmaceutical tablets is presented. In addition, a comparison between the proposed method and PLS, RSIMPLS, PRM is provided. The obtained results demonstrate that the proposed method is highly efficient.  相似文献   

17.
Ortiz MC  Sarabia LA  Herrero A 《Talanta》2006,70(3):499-512
The validation of an analytical procedure means the evaluation of some performance criteria such as accuracy, sensitivity, linear range, capability of detection, selectivity, calibration curve, etc. This implies the use of different statistical methodologies, some of them related with statistical regression techniques, which may be robust or not. The presence of outlier data has a significant effect on the determination of sensitivity, linear range or capability of detection amongst others, when these figures of merit are evaluated with non-robust methodologies.In this paper some of the robust methods used for calibration in analytical chemistry are reviewed: the Huber M-estimator; the Andrews, Tukey and Welsh GM-estimators; the fuzzy estimators; the constrained M-estimators, CM; the least trimmed squares, LTS. The paper also shows that the mathematical properties of the least median squares (LMS) regression can be of great interest in the detection of outlier data in chemical analysis. A comparative analysis is made of the results obtained by applying these regression methods to synthetic and real data. There is also a review of some applications where this robust regression works in a suitable and simple way that proves very useful to secure an objective detection of outliers. The use of a robust regression is recommended in ISO 5725-5.  相似文献   

18.
19.
The application of two-dimensional electrophoresis (2-DE) to mutation detection requires the capability to monitor each protein in a 2-DE pattern for significant changes in abundance indicative of a mutation event. Previously, mutation searches were done using a univariate outlier detection method in which each protein spot was considered independently in a classical outlier search. An alternative approach to analysis of 2-DE patterns for quantitative changes is a multivariate procedure which takes advantage of the observation that protein spots in a 2-DE pattern often represent correlated rather than independent measurements. We have compared the efficiency of univariate and multivariate procedures for mutation detection using data from the Argonne National Laboratory 2-DE database of mouse liver proteins. Analyses involving a total of over 1500 gels were performed to compare the performance of a multivariate method based on principal components analysis (PCA) with the univariate method. Up to 279 spots from each pattern were used for PCA. First, a simulation was performed to assess the detection efficiency of PCA for single protein spots decreased in abundance by 50%. Then, the ability to detect actual mutations was tested using eight confirmed mutations. Results show that, compared to a univariate approach to analysis of data from the mouse model system, the multivariate method increases the number of protein spots on each 2-DE pattern that can be monitored for quantitative changes indicative of mutations by compensating for variables that contribute to the background quantitative variability of protein spots.  相似文献   

20.
Datasets of molecular compounds often contain outliers, that is, compounds which are different from the rest of the dataset. Outliers, while often interesting may affect data interpretation, model generation, and decisions making, and therefore, should be removed from the dataset prior to modeling efforts. Here, we describe a new method for the iterative identification and removal of outliers based on a k‐nearest neighbors optimization algorithm. We demonstrate for three different datasets that the removal of outliers using the new algorithm provides filtered datasets which are better than those provided by four alternative outlier removal procedures as well as by random compound removal in two important aspects: (1) they better maintain the diversity of the parent datasets; (2) they give rise to quantitative structure activity relationship (QSAR) models with much better prediction statistics. The new algorithm is, therefore, suitable for the pretreatment of datasets prior to QSAR modeling. © 2014 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号