首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 135 毫秒
1.
Modeling quantitative structure–activity relationships (QSAR) is considered with an emphasis on prediction. An abundance of methods are available to develop such models. Using a harmonious approach that balances the bias and variance of predictions, the best calibration models are identified relative to the bias and variance criteria used. Criteria utilized to determine the adequacy of models are the root mean square error of calibration (RMSEC) and validation (RMSEV), respective R 2 values, and the norm of the regression vector. QSAR data from the literature are used to demonstrate concepts. For these data sets and criteria used, it is suggested that models obtained by ridge regression (RR) are more harmonious and parsimonious than models obtained by partial least squares (PLS) and principal component regression (PCR) when the data is mean-centered. The most harmonious RR models have the best bias/variance tradeoff reflected by the smallest RMSEC, RMSEV, and regression vector norms and the largest calibration and validation R 2 values. The most parsimonious RR models have the smallest effective rank.  相似文献   

2.
In multivariate regression, it is often reported that wavelength selection can improve results. Improvement is often solely based on bias measures such as the root mean square error of calibration (RMSEC) and root mean square error of validation (RMSEV), R2 for the calibration and validation, etc. In recent studies, it has been shown that when variance measures are included, Pareto optimal models can be determined. However, variance measures used to date do not provide the ability to choose wavelength subset models relative to full wavelength models when wavelength subset models may be the Pareto models. In this paper, simplex optimization is used with a more complete variance measure to generate Pareto optimal models. The standard basis set is used as well a basis set that includes the range and null space of the calibration spectra. Results show that it is possible to identify Pareto optimal models and if a wavelength subset is best, these are the models found. Regression coefficients for non-essential wavelengths are zero to near zero.  相似文献   

3.
Partial least-squares (PLS) regression was used to generate various models for the determination of both the protein and the ash contents of wheat flours by using spectroscopic data in the mid-infrared region obtained with a horizontal attenuated total reflectance (HATR) accessory. One hundred samples of wheat flour were used as purchased in the market: 55 for constructing the calibration model and 45 as external samples. The protein content varied between 8.85 and 13.23% and the ash content, between 0.330 and 1.287%, as determined by reference methods. Raw spectra and those corrected by multiplicative signal correction (MSC), first and second derivative spectra, were used as data for building the models. Different pre-treatments, such as mean centered and/or variance scaled (VS) methods, were tested and compared. Very good models were built as judged by the correlation coefficients (R2), root mean square error of calibration (RMSEC), root mean square error of validation (RMSEV) and root mean square error of prediction (RMSEP) that were obtained. Best results were achieved with MSC treated spectra.  相似文献   

4.
Most multivariate calibration methods require selection of tuning parameters, such as partial least squares (PLS) or the Tikhonov regularization variant ridge regression (RR). Tuning parameter values determine the direction and magnitude of respective model vectors thereby setting the resultant predication abilities of the model vectors. Simultaneously, tuning parameter values establish the corresponding bias/variance and the underlying selectivity/sensitivity tradeoffs. Selection of the final tuning parameter is often accomplished through some form of cross-validation and the resultant root mean square error of cross-validation (RMSECV) values are evaluated. However, selection of a “good” tuning parameter with this one model evaluation merit is almost impossible. Including additional model merits assists tuning parameter selection to provide better balanced models as well as allowing for a reasonable comparison between calibration methods. Using multiple merits requires decisions to be made on how to combine and weight the merits into an information criterion. An abundance of options are possible. Presented in this paper is the sum of ranking differences (SRD) to ensemble a collection of model evaluation merits varying across tuning parameters. It is shown that the SRD consensus ranking of model tuning parameters allows automatic selection of the final model, or a collection of models if so desired. Essentially, the user’s preference for the degree of balance between bias and variance ultimately decides the merits used in SRD and hence, the tuning parameter values ranked lowest by SRD for automatic selection. The SRD process is also shown to allow simultaneous comparison of different calibration methods for a particular data set in conjunction with tuning parameter selection. Because SRD evaluates consistency across multiple merits, decisions on how to combine and weight merits are avoided. To demonstrate the utility of SRD, a near infrared spectral data set and a quantitative structure activity relationship (QSAR) data set are evaluated using PLS and RR.  相似文献   

5.
A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross‐validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal‐to‐noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near‐Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure‐activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

6.
The main utility of QSAR models is their ability to predict activities/properties for new chemicals, and this external prediction ability is evaluated by means of various validation criteria. As a measure for such evaluation the OECD guidelines have proposed the predictive squared correlation coefficient Q(2)(F1) (Shi et al.). However, other validation criteria have been proposed by other authors: the Golbraikh-Tropsha method, r(2)(m) (Roy), Q(2)(F2) (Schu?u?rmann et al.), Q(2)(F3) (Consonni et al.). In QSAR studies these measures are usually in accordance, though this is not always the case, thus doubts can arise when contradictory results are obtained. It is likely that none of the aforementioned criteria is the best in every situation, so a comparative study using simulated data sets is proposed here, using threshold values suggested by the proponents or those widely used in QSAR modeling. In addition, a different and simple external validation measure, the concordance correlation coefficient (CCC), is proposed and compared with other criteria. Huge data sets were used to study the general behavior of validation measures, and the concordance correlation coefficient was shown to be the most restrictive. On using simulated data sets of a more realistic size, it was found that CCC was broadly in agreement, about 96% of the time, with other validation measures in accepting models as predictive, and in almost all the examples it was the most precautionary. The proposed concordance correlation coefficient also works well on real data sets, where it seems to be more stable, and helps in making decisions when the validation measures are in conflict. Since it is conceptually simple, and given its stability and restrictiveness, we propose the concordance correlation coefficient as a complementary, or alternative, more prudent measure of a QSAR model to be externally predictive.  相似文献   

7.
基于岭回归和SVM的高维特征选择与肽QSAR建模   总被引:1,自引:0,他引:1  
岭回归估计权重绝对值在一定程度上体现了对应特征作用大小, 据此发展了基于岭回归(RR)和支持向量机(SVM)的高维特征选择算法. 对苦味二肽(BTT)和细胞毒性T淋巴细胞(CTL)表位9 肽两个肽体系, 以氨基酸的531 个物理化学性质参数直接表征肽结构, 各获得1062、4779 个初始特征; 对训练集, 初始特征以岭回归排序后序贯引入, 当SVM留一法交叉测试(LOOCV)的均方误差(MSE)显著上扬时终止, 最后以多轮末尾淘汰进一步精筛, 分别获得7、18个物理化学意义明确的保留特征. 基于保留特征与支持向量回归(SVR), 对训练集建立定量构效关系(QSAR)模型, 预测独立测试集, 其拟合精度、留一法交叉测试精度、独立预测精度均优于现有文献报道结果. 新方法运行速度快, 选取的特征物理化学意义明确, 解释性强, 在肽、蛋白质定量构效关系建模等高维数据回归预测领域有较广泛应用前景.  相似文献   

8.
9.
Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories.  相似文献   

10.
In this work, the ability of an electronic tongue based on Fourier-Transform Mid Infrared (FT-MIR) spectroscopy as a gustative sensor is assessed by emulating the responses of a tasting panel for the gustative mouthfeel “tannin amount”. The FT-MIR spectra were modeled against the sensory responses evaluated in 37 red wines by means of partial least squares (PLS) regression models. In order to find the wavenumbers more correlated with the sensorial attribute and thus providing the best predictive models, six different variable selection techniques were tested. The iterative predictor weighting IPW-PLS technique showed the best results with the smallest RMSEC and RMSECV values (0.07 and 0.13, respectively) using 20 selected wavenumbers. The coincident wavenumbers selected by the six variable selection techniques were interpreted based on the absorption bands of tannin and then a calibration model using these wavenumbers was built to validate the interpretation made.  相似文献   

11.
12.
13.
14.
15.
Two-dimensional correlation spectroscopy (2DCOS) and near-infrared spectroscopy (NIRS) were used to determine the polyphenol content in oat grain. A partial least squares (PLS) algorithm was used to perform the calibration. A total of 116 representative oat samples from four locations in China were prepared and the corresponding near-infrared spectra were measured. Two-dimensional correlation spectroscopy was employed to select wavelength bands for the PLS regression model for the polyphenol determination. The number of PLS components and intervals was optimized according to the coefficients of determination (R2) and root mean square error of cross validation (RMSECV) in the calibration set. The performance of the final model was evaluated using the correlation coefficient (R) and the root mean square error of validation (RMSEV) in the prediction set. The results showed the band corresponding to the optimal calibration model was between 1350 and 1848?nm and the optimal spectral preprocessing combination was second derivative with second smoothing. The optimal regression model was obtained with an R2 of 0.8954 and an RMSECV of 0.06651 in the calibration set and R of 0.9614 and RMSEV of 0.04573 in the prediction set. These measurements reveal the calibration model had qualified predictive accuracy. The results demonstrated that the 2DCOS with PLS was a simple and rapid method for the quantitative determination of polyphenols in oats.  相似文献   

16.
A direct and reagent free procedure for simultaneous determination of sodium lauryl ether sulfate (SLES), coconut diethanol amide (CDEA) and linear alkylbenzene sulfonate (LABS) in undiluted samples of hand dishwashing liquids has been developed. This determination was carried out by using attenuated total reflectance Fourier transform infrared spectrometry (ATR-FTIR) and multivariate analysis. An implementation of the PLS statistical approach to quantitative analysis of one nonionic and two anionic surfactants was applied to a set of mid-infrared spectra (1305-990 cm(-1)) recorded for commercial detergent samples and ternary standard solutions. An orthogonal calibration design for three components and five levels for standards were employed. Number of factors and scans and also the resolution were optimized. The statistical parameters such as the root mean square error of calibration (RMSEC), root mean square error of cross-validation (RMSECV), standard error of prediction (SEP) and relative standard deviation (R.S.D.) were evaluated. These parameters were obtained as: RMSEC 0.13, 0.20 and 0.14, RMSEV 0.09, 0.17 and 0.04 and SEP 0.12, 0.39 and 0.18 (g per 100 g) for SLES, CDEA and LABS, respectively. R.S.D. for five independent analyses were 1.69 for SLES, 3.76 for CDEA and 1.76 for LABS. The component linear correlation coefficients comparing actual and predicted concentrations of SLES, CDEA and LABS in some real samples were 0.9995, 0.9915 and 0.9974, respectively.  相似文献   

17.
18.
Topological indices (TIs) and atom pairs (APs) were used to develop quantitative structure-activity relationship (QSAR) models of a set of 58 dipeptide boronic acids which are potent inhibitors of proteasome and have found applications in the treatment of various types of cancers. Of the three linear regression methods used for QSAR development, viz., principal components regression (PCR), partial least square (PLS), and ridge regression (RR), the last method gave the most satisfactory models whereas the remaining two methods yielded poor models. RR results obtained in this paper using TIs and APs are comparable to the CoMFA and CoMSIA results reported in the literature with the same set of compounds.  相似文献   

19.
20.
A robust near infrared (NIR) method able to quantify the active content of pilot non-coated pharmaceutical pellets was developed. A protocol of calibration was followed, involving 2 operators, independent pilot batches of non-coated pharmaceutical pellets and two different NIR acquisition temperatures. Prediction models based on Partial Least Squares (PLS) regression were then carried out. Afterwards, the NIR method was fully validated for an active content ranging from 80 to 120% of the usual active content using new independent pilot batches to evaluate the adequacy of the method to its final purpose. Conventional criteria such as the R2, the Root Mean Square Error of Calibration (RMSEC), the Root Mean Square Error of Prediction (RMSEP) and the number of PLS factors enabled the selection of models with good predictive potential. However, such criteria sometimes fail to choose the most fitted for purpose model. Therefore, a novel approach based on accuracy profiles of the validation results was used, providing a visual representation of the actual and future performances of the models. Following this approach, the prediction model using signal pre-treatment Multiplicative Scatter Correction (MSC) was chosen as it showed the best ability to quantify accurately the active content over the 80-120% active content range. The reliability of the NIR method was tested with new pilot batches of non-coated pharmaceutical pellets containing 90 and 110% of the usual active content, with blends of validation batches and industrial batches. All those batches were also analyzed by the HPLC reference method and relative errors were calculated: the results showed low relative errors in full accordance with the results obtained during the validation of the method, indicating the reliability of the NIR method and its interchangeability with the HPLC reference method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号