首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Modeling quantitative structure-activity relationships (QSAR) is considered with an emphasis on prediction. An abundance of methods are available to develop such models. Using a harmonious approach that balances the bias and variance of predictions, the best calibration models are identified relative to the bias and variance criteria used. Criteria utilized to determine the adequacy of models are the root mean square error of calibration (RMSEC) and validation (RMSEV), respective R2 values, and the norm of the regression vector. QSAR data from the literature are used to demonstrate concepts. For these data sets and criteria used, it is suggested that models obtained by ridge regression (RR) are more harmonious and parsimonious than models obtained by partial least squares (PLS) and principal component regression (PCR) when the data is mean-centered. The most harmonious RR models have the best bias/variance tradeoff, reflected by the smallest RMSEC, RMSEV, and regression vector norms and the largest calibration and validation R2 values. The most parsimonious RR models have the smallest effective rank.  相似文献   

2.
In multivariate regression, it is often reported that wavelength selection can improve results. Improvement is often solely based on bias measures such as the root mean square error of calibration (RMSEC) and root mean square error of validation (RMSEV), R2 for the calibration and validation, etc. In recent studies, it has been shown that when variance measures are included, Pareto optimal models can be determined. However, variance measures used to date do not provide the ability to choose wavelength subset models relative to full wavelength models when wavelength subset models may be the Pareto models. In this paper, simplex optimization is used with a more complete variance measure to generate Pareto optimal models. The standard basis set is used as well a basis set that includes the range and null space of the calibration spectra. Results show that it is possible to identify Pareto optimal models and if a wavelength subset is best, these are the models found. Regression coefficients for non-essential wavelengths are zero to near zero.  相似文献   

3.
Two-dimensional correlation spectroscopy (2DCOS) and near-infrared spectroscopy (NIRS) were used to determine the polyphenol content in oat grain. A partial least squares (PLS) algorithm was used to perform the calibration. A total of 116 representative oat samples from four locations in China were prepared and the corresponding near-infrared spectra were measured. Two-dimensional correlation spectroscopy was employed to select wavelength bands for the PLS regression model for the polyphenol determination. The number of PLS components and intervals was optimized according to the coefficients of determination (R2) and root mean square error of cross validation (RMSECV) in the calibration set. The performance of the final model was evaluated using the correlation coefficient (R) and the root mean square error of validation (RMSEV) in the prediction set. The results showed the band corresponding to the optimal calibration model was between 1350 and 1848?nm and the optimal spectral preprocessing combination was second derivative with second smoothing. The optimal regression model was obtained with an R2 of 0.8954 and an RMSECV of 0.06651 in the calibration set and R of 0.9614 and RMSEV of 0.04573 in the prediction set. These measurements reveal the calibration model had qualified predictive accuracy. The results demonstrated that the 2DCOS with PLS was a simple and rapid method for the quantitative determination of polyphenols in oats.  相似文献   

4.
Partial least-squares (PLS) regression was used to generate various models for the determination of both the protein and the ash contents of wheat flours by using spectroscopic data in the mid-infrared region obtained with a horizontal attenuated total reflectance (HATR) accessory. One hundred samples of wheat flour were used as purchased in the market: 55 for constructing the calibration model and 45 as external samples. The protein content varied between 8.85 and 13.23% and the ash content, between 0.330 and 1.287%, as determined by reference methods. Raw spectra and those corrected by multiplicative signal correction (MSC), first and second derivative spectra, were used as data for building the models. Different pre-treatments, such as mean centered and/or variance scaled (VS) methods, were tested and compared. Very good models were built as judged by the correlation coefficients (R2), root mean square error of calibration (RMSEC), root mean square error of validation (RMSEV) and root mean square error of prediction (RMSEP) that were obtained. Best results were achieved with MSC treated spectra.  相似文献   

5.
6.
A QSAR study on a series of pyrimidinyl and triazinyl amines was performed to explore the physico-chemical parameters responsible for their anti-HIV activity and cytotoxicity. Physico-chemical parameters were calculated using WIN CAChe 6.1. Stepwise multiple linear regression analysis was carried out to derive QSAR models which were further evaluated for statistical significance and predictive power by internal and external validation. The selected best QSAR models showed correlation coefficient R of 0.914 and 0.901, and cross-validated squared correlation coefficient Q 2 of 0.685 and 0.691 for anti-HIV activity and cytotoxicity, respectively. The developed significant QSAR model indicates that hydrophobicity of the whole molecule plays an important role in the anti-HIV activity and cytotoxicity of pyrimidinyl and triazinyl amine derivatives. When hydrophobicity is increased, anti-HIV activity of the present series of compounds is decreased leading to high cytotoxicity.  相似文献   

7.
8.
9.
The development of robust QSAR models to predict the activity of molecules of β-secretase inhibitors is an area of interest due to the increase of Alzheimer’s disease in patients in the global population. In this paper, we present a proposal based on the use of relative distance matrices as input data to the QSAR algorithms. These matrices store measurements of distances between the structural characteristics of pairs of molecules and between the molecules and a structural pattern extracted from the whole data set, thus efficiently representing a correlation between structural changes and activity. For the building of the classification and regression models support vector machine, tree complex and Gaussian process algorithms have been used; and for the validation of the models cross-validation, bootstrapping and y-randomizing techniques have been applied. The results obtained are close to 100% in accuracy and area under receiver operating characteristic values in classification, and close to 1.0 for r 2 and 0.1 for root mean square error in regression in training and in external validation, proving the ‘goodness’ of the proposal.  相似文献   

10.
11.
Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q 2 for the training set and accuracy of prediction (R 2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.  相似文献   

12.
Most multivariate calibration methods require selection of tuning parameters, such as partial least squares (PLS) or the Tikhonov regularization variant ridge regression (RR). Tuning parameter values determine the direction and magnitude of respective model vectors thereby setting the resultant predication abilities of the model vectors. Simultaneously, tuning parameter values establish the corresponding bias/variance and the underlying selectivity/sensitivity tradeoffs. Selection of the final tuning parameter is often accomplished through some form of cross-validation and the resultant root mean square error of cross-validation (RMSECV) values are evaluated. However, selection of a “good” tuning parameter with this one model evaluation merit is almost impossible. Including additional model merits assists tuning parameter selection to provide better balanced models as well as allowing for a reasonable comparison between calibration methods. Using multiple merits requires decisions to be made on how to combine and weight the merits into an information criterion. An abundance of options are possible. Presented in this paper is the sum of ranking differences (SRD) to ensemble a collection of model evaluation merits varying across tuning parameters. It is shown that the SRD consensus ranking of model tuning parameters allows automatic selection of the final model, or a collection of models if so desired. Essentially, the user’s preference for the degree of balance between bias and variance ultimately decides the merits used in SRD and hence, the tuning parameter values ranked lowest by SRD for automatic selection. The SRD process is also shown to allow simultaneous comparison of different calibration methods for a particular data set in conjunction with tuning parameter selection. Because SRD evaluates consistency across multiple merits, decisions on how to combine and weight merits are avoided. To demonstrate the utility of SRD, a near infrared spectral data set and a quantitative structure activity relationship (QSAR) data set are evaluated using PLS and RR.  相似文献   

13.
《中国化学会会志》2018,65(5):567-577
Calpeptin analogs show anticancer properties with inhibition of calpain. In this work, we applied a quantitative structure–activity relationship (QSAR) model on 34 calpeptin derivatives to select the most appropriate compound. QSAR was employed to generate the models and predict the more significant compounds through a series of calpeptin derivatives. The HyperChem, Gaussian 09, and Dragon software programs were used for geometry optimization of the molecules. The 2D and 3D molecular structures were drawn by ChemDraw (Ultra 16.0) and Chem3D (Pro16.0) software. The Unscrambler program was used for the analysis of data. Multiple linear regression (MLR‐MLR), partial least‐squares (MLR‐PLS1), principal component regression (MLR‐PCR), a genetic algorithm‐artificial neural networks (GA‐ANN), and a novel similarity analysis‐artificial neural network (SA‐ANN) method were used to create QSAR models. Among the three MLR models, MLR‐MLR provided better statistical parameters. The R2 and RMSE of the prediction were estimated as 0.8248 and 0.26, respectively. Nevertheless, the constructed model using GA‐ANN revealed the best statistical parameters among the studied methods (R2 test = 0.9643, RMSE test = 0.0155, R2 train = 0.9644, RMSE train = 0.0139). The GA‐ANN model is found to be the most favorable method among the statistical methods and can be employed for designing new calpeptin analogs as potent calpain inhibitors in cancer treatment.  相似文献   

14.
Using 84 structurally diverse and experimentally validated LSD1/KDM1A inhibitors, quantitative structure–activity relationship (QSAR) models were built by OECD requirements. In the QSAR analysis, certainly significant and understated pharmacophoric features were identified as critical for LSD1 inhibition, such as a ring Carbon atom with exactly six bonds from a Nitrogen atom, partial charges of lipophilic atoms within eight bonds from a ring Sulphur atom, a non-ring Oxygen atom exactly nine bonds from the amide Nitrogen, etc. The genetic algorithm–multi-linear regression (GA-MLR) and double cross-validation criteria were used to create robust QSAR models with high predictability. In this study, two QSAR models were developed, with fitting parameters like R2 = 0.83–0.81, F = 61.22–67.96, internal validation parameters such as Q2LOO = 0.79–0.77, Q2LMO = 0.78–0.76, CCCcv = 0.89–0.88, and external validation parameters such as, R2ext = 0.82 and CCCex = 0.90. In terms of mechanistic interpretation and statistical analysis, both QSAR models are well-balanced. Furthermore, utilizing the pharmacophoric features revealed by QSAR modelling, molecular docking experiments corroborated with the most active compound’s binding to the LSD1 receptor. The docking results are then refined using Molecular dynamic simulation and MMGBSA analysis. As a consequence, the findings of the study can be used to produce LSD1/KDM1A inhibitors as anticancer leads.  相似文献   

15.
Near infrared (NIR) spectroscopy based on effective wavelengths (EWs) and chemometrics was proposed to discriminate the varieties of fruit vinegars including aloe, apple, lemon and peach vinegars. One hundred eighty samples (45 for each variety) were selected randomly for the calibration set, and 60 samples (15 for each variety) for the validation set, whereas 24 samples (6 for each variety) for the independent set. Partial least squares discriminant analysis (PLS-DA) and least squares-support vector machine (LS-SVM) were implemented for calibration models. Different input data matrices of LS-SVM were determined by latent variables (LVs) selected by explained variance, and EWs selected by x-loading weights, regression coefficients, modeling power and independent component analysis (ICA). Then the LS-SVM models were developed with a grid search technique and RBF kernel function. All LS-SVM models outperformed PLS-DA model, and the optimal LS-SVM model was achieved with EWs (4021, 4058, 4264, 4400, 4853, 5070 and 5273 cm−1) selected by regression coefficients. The determination coefficient (R2), RMSEP and total recognition ratio with cutoff value ±0.1 in validation set were 1.000, 0.025 and 100%, respectively. The overall results indicted that the regression coefficients was an effective way for the selection of effective wavelengths. NIR spectroscopy combined with LS-SVM models had the capability to discriminate the varieties of fruit vinegars with high accuracy.  相似文献   

16.
17.
18.
19.
Bioethanol can be obtained from wood by simultaneous enzymatic saccharification and fermentation step (SSF). However, for enzymatic process to be effective, a pretreatment is needed to break the wood structure and to remove lignin to expose the carbohydrates components. Evaluation of these processes requires characterization of the materials generated in the different stages. The traditional analytical methods of wood, pretreated materials (pulps), monosaccharides in the hydrolyzated pulps, and ethanol involve laborious and destructive methodologies. This, together with the high cost of enzymes and the possibility to obtain low ethanol yields from some pulps, makes it suitable to have rapid, nondestructive, less expensive, and quantitative methods to monitoring the processes to obtain ethanol from wood. In this work, infrared spectroscopy (IR) accompanied with multivariate analysis is used to characterize chemically organosolv pretreated Eucalyptus globulus pulps (glucans, lignin, and hemicellulosic sugars), as well as to predict the ethanol yield after a SSF process. Mid (4,000–400 cm?1) and near-infrared (12,500–4,000 cm?1) spectra of pulps were used in order to obtain calibration models through of partial least squares regression (PLS). The obtained multivariate models were validated by cross validation and by external validation. Mid-infrared (mid-IR)/NIR PLS models to quantify ethanol concentration were also compared with a mathematical approach to predict ethanol yield estimated from the chemical composition of the pulps determined by wet chemical methods (discrete chemical data). Results show the high ability of the infrared spectra in both regions, mid-IR and NIR, to calibrate and predict the ethanol yield and the chemical components of pulps, with low values of standard calibration and validation errors (root mean square error of calibration, root mean square error of validation (RMSEV), and root mean square error of prediction), high correlation between predicted and measured by the reference methods values (R 2 between 0.789 and 0.997), and adequate values of the ratio between the standard deviation of the reference methods and the standard errors of infrared PLS models relative performance determinant (RPD) (greater than 3 for majority of the models). Use of IR for ethanol quantification showed similar and even better results to the obtained with the discrete chemical data, especially in the case of mid-IR models, where ethanol concentration can be estimated with a RMSEV equal to 1.9 g?L?1. These results could facilitate the analysis of high number of samples required in the evaluation and optimization of the processes.  相似文献   

20.
In the quantitative structure‐activity relationship (QSAR) study, local lazy regression (LLR) can predict the activity of a query molecule by using the information of its local neighborhood without need to produce QSAR models a priori. When a prediction is required for a query compound, a set of local models including different number of nearest neighbors are identified. The leave‐one‐out cross‐validation (LOO‐CV) procedure is usually used to assess the prediction ability of each model, and the model giving the lowest LOO‐CV error or highest LOO‐CV correlation coefficient is chosen as the best model. However, it has been proved that the good statistical value from LOO cross‐validation appears to be the necessary, but not the sufficient condition for the model to have a high predictive power. In this work, a new strategy is proposed to improve the predictive ability of LLR models and to access the accuracy of a query prediction. The bandwidth of k neighbor value for LLR is optimized by considering the predictive ability of local models using an external validation set. This approach was applied to the QSAR study of a series of thienopyrimidinone antagonists of melanin‐concentrating hormone receptor 1. The obtained results from the new strategy shows evident improvement compared with the commonly used LOO‐CV LLR methods and the traditional global linear model. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号