首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 494 毫秒
1.
2.
3.
In the current work we investigated 3D-QSAR data by the use of the coupled leave-several-out (LSO) and leave-one-out (LOO) cross-validation (CV) procedures. We verified the above mentioned scheme using both simulated data and real 3D QSAR data describing a series of CoMFA steroids, heterocyclic azo dyes and styrylquinoline HIV integrase inhibitors. Unlike in standard analyses, this technique characterizes individual method not by a single performance metrics but screens a whole possible modeling space by sampling different molecules into the training and test sets, respectively. This allowed us for the discussion of the information included in the estimators validating cross-validation procedures, as well as the comparison of the efficiency of several 3D QSAR schemes, in particular, Comparative Molecular Field Analysis (CoMFA) and Comparative Molecular Surface Analysis (CoMSA). Moreover, it allows one to acquire some general knowledge about predictive and modeling ability in 3D QSAR method.  相似文献   

4.
5.
6.
7.
Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories.  相似文献   

8.
9.
从20种天然氨基酸的41个randic molecular profiles非零描述符、44个eigenvalue based indices非零描述符和47个walk and path counts非零描述符分别进行主成分分析,得出一种新的氨基酸描述符-SVREW。将其应用于血管紧张素转化酶(ACE)抑制二肽和ACE抑制三肽、苦味二肽和苦味四肽、后叶催产素类似物、HLA-A*0201限制性CTL表位肽的结构表征,应用多元线性回归(MLR)建立定量构效关系模型,同时采用内部与外部双重验证的方法验证模型的稳定性。所建ACE抑制二肽、ACE抑制三肽、苦味二肽、苦味四肽、后叶催产素类似物、HLA-A*0201限制性CTL表位肽的模型复相关系数(R2cum)分别为0.994,0.797,0.948,0.878,0.686,0.720;留一法交互校验复相关系数(R2cv)分别为0.955,0.859,0.879,0.958,0.796,0.843;外部样本校验相关系数(Q2ext)分别为0.990,0.954,0.890,0.950,0.748,0.773。经研究表明SVREW描述符用于肽分子结构表征所建模型的稳定性与预测能力均较好,有望成为多肽定量构效关系研究中一种有效的结构表征方法,可对新药物的发现和研究提供指导。  相似文献   

10.
Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q 2 for the training set and accuracy of prediction (R 2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.  相似文献   

11.
It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set – i.e., a cross-validation set – is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned `at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed `boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k = 4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.  相似文献   

12.
13.
Assessing model fit by cross-validation   总被引:8,自引:0,他引:8  
When QSAR models are fitted, it is important to validate any fitted model-to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. There are two standard ways of doing this-using a separate hold-out test sample and the computationally much more burdensome leave-one-out cross-validation in which the entire pool of available compounds is used both to fit the model and to assess its validity. We show by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small-in the dozens or scores rather than the hundreds, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.  相似文献   

14.
15.
A quantitative structure-activity relationship (QSAR) of a series of benzothiazole derivatives showing a potent and selective cytotoxicity against a tumorigenic cell line has been studied by using the density functional theory (DFT), molecular mechanics (MM ) and statistical methods, and the QSAR equation was established via a correlation analysis and a stepwise regression analysis. A new scheme determining outliers by "leave-one-out" (LOO) cross-validation coefficient (q2n-i) was suggested and successfully used. In the established optimal equation (excluding two outliers), the steric parameter (MRR) and the net charge (QFR) of the first atom of the substituent (R), as well as the square of hydrophobic parameter (lgP)2 of the whole molecule, are the main independent factors contributing to the anticancer activities of the compounds. The fitting correlation coefficient (R2) and the cross-validation coefficient (q2) values are 0.883 and 0.797, respectively. It indicates that this model has a significantly statistical quality and an excellent prediction ability. Based on the QSAR studies, 4 new compounds with high predicted anticancer activities have been theoretically designed and they are expected to be confirmed experimentally.  相似文献   

16.
17.
18.
19.
Traditionally, QSAR and QSPR models have been fitted by splitting the available compounds into separate learning and validation sets. The model is then fitted to the learning set and assessed using the validation set. Cross-validation (CV) uses all available compounds for both purposes, so that the full body of available information is brought to bear on both the learning and the validation portions of the study. The price paid for this additional information is a substantially greater computational load. A common mistake in using CV is to omit some of the repetitive computations. This mistake leads to substantial bias in the assessment. A hydroxyl radical reaction rate dataset is used to illustrate the superiority of CV and the pitfalls from its improper execution when modeling using nearest neighbors, paralleling behavior in the well-studied linear model setting.  相似文献   

20.
利用氨基酸结构描述符SVHEHS分别对血管紧张素转化酶(Angiotensin I-converting Enzyme,ACE)竞争性抑制二肽、三肽、四肽序列表征后,建立结构与活性的多元线性回归(MLR)模型。ACE抑制二肽模型的相关系数、交叉验证相关系数、均方根误差、外部验证相关系数分别为0.851、0.781、0.327、0.792;三肽模型分别为0.805、0.717、0.339、0.817;四肽模型分别为0.792、0.553、0.393、0.630。研究表明,运用该描述符建立的ACE抑制肽MLR模型拟合、预测能力均较好,能较好解释ACE抑制肽的活性与结构间的关系。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号