首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A new in silico model is developed to predict cytochrome P450 2D6 inhibition from 2D chemical structure. Using a diverse training set of 100 compounds with published inhibition constants, an ensemble approach to recursive partitioning is applied to create a large number of classification trees, each of which yields a yes/no prediction about inhibition for a given compound. These binary classifications are combined to provide an overall prediction, which answers the yes/no question about inhibition and provides a measure of confidence about that prediction. Compared to single-tree models, the ensemble approach is less sensitive to noise in the experimental data as well as to changes in the training set. Internal validation tests indicated an overall classification accuracy of 75%, whereas predictions applied to an external set of 51 compounds yielded 80% accuracy, with all inhibitors correctly identified. The speed and 2D nature of this model make it appropriate for high-throughput processing of large chemical libraries, and the confidence level provides a continuous scale on which to prioritize compounds.  相似文献   

2.
Near-infrared (NIR) spectra in the region of 5000-4000 cm−1 with a chemometric method called searching combination moving window partial least squares (SCMWPLS) were employed to determine the concentrations of human serum albumin (HSA), γ-globulin, and glucose contained in the control serum IIB (CS IIB) solutions with various concentrations. SCMWPLS is proposed to search for the optimized combinations of informative regions, which are spectral intervals, considered containing useful information for building partial least squares (PLS) models. The informative regions can easily be found by moving window partial least squares regression (MWPLSR) method. PLS calibration models using the regions obtained by SCMWPLS were developed for HSA, γ-globulin, and glucose. These models showed good prediction with the smallest root mean square error of predictions (RMSEP), the relatively small number of PLS factors, and the highest correlation coefficients among the results achieved by using whole region and MWPLSR methods. The RMSEP values of HSA, γ-globulin, and glucose yielded by SCMWPLS were 0.0303, 0.0327, and 0.0195 g/dl, respectively. These results prove that SCMWPLS can be successfully applied to determine simultaneously the concentrations of HSA, γ-globulin, and glucose in complicated biological fluids such as CS IIB solutions by using NIR spectroscopy.  相似文献   

3.
Predicting potentially dangerous chemical reactions is a critical task for laboratory safety. However, a traditional experimental investigation of reaction conditions for possible hazardous or explosive byproducts entails substantial time and cost, for which machine learning prediction could accelerate the process and help detailed experimental investigations. Several machine learning models have been developed which allow the prediction of major chemical reaction products with reasonable accuracy. However, these methods may not present sufficiently high accuracy for the prediction of hazardous products which particularly requires a low false negative result for laboratory safety in order not to miss any dangerous reactions. In this work, we propose an explainable artificial intelligence model that can predict the formation of hazardous reaction products in a binary classification fashion. The reactant molecules are transformed into substructure-encoded fingerprints and then fed into a convolutional neural network to make the binary decision of the chemical reaction. The proposed model shows a false negative rate of 0.09, which can be compared with 0.47–0.66 using the existing main product prediction models. To provide explanations for what substructures of the given reactant molecules are important to make a decision for target hazardous product formation, we apply an input attribution method, layer-wise relevance propagation, which computes the contributions of individual inputs per input data. The computed attributions indeed match some of the existing chemical intuitions and mechanisms, and also offer a way to analyze possible data-imbalance issues of the current predictions based on relatively small positive datasets. We expect that the proposed hazardous product prediction model will be complementary to existing main product prediction models and experimental investigations.

An explainable neural network model is developed to predict the formation of hazardous products for chemical reactions. An input attribution method, layer-wise relevance propagation, is used to explain the decision-making process.  相似文献   

4.
5.
The rivality index (RI) is a normalized distance measurement between a molecule and their first nearest neighbours providing a robust prediction of the activity of a molecule based on the known activity of their nearest neighbours. Negative values of the RI describe molecules that would be correctly classified by a statistic algorithm and, vice versa, positive values of this index describe those molecules detected as outliers by the classification algorithms. In this paper, we have described a classification algorithm based on the RI and we have proposed four weighted schemes (kernels) for its calculation based on the measuring of different characteristics of the neighbourhood of molecules for each molecule of the dataset at established values of the threshold of neighbours. The results obtained have demonstrated that the proposed classification algorithm, based on the RI, generates more reliable and robust classification models than many of the more used and well-known machine learning algorithms. These results have been validated and corroborated by using 20 balanced and unbalanced benchmark datasets of different sizes and modelability. The classification models generated provide valuable information about the molecules of the dataset, the applicability domain of the models and the reliability of the predictions.  相似文献   

6.
7.
8.
Many complex natural or synthetic products are analysed either by the GC–MS (gas chromatography–mass spectrometry) or HPLC–DAD (high performance liquid chromatography–diode-array detector) technique, each of which produces a one-dimensional fingerprint for a given sample. This may be used for classification of different batches of a product. GC–MS and HPLC–DAD analyses of complex, similar substances represented by the three common types of the TCM (traditional Chinese medicine), Rhizoma Curcumae were analysed in the form of one- and two-dimensional matrices firstly with the use of PCA (Principal component analysis), which showed a reasonable separation of the samples for each technique. However, the separation patterns were rather different for each analytical method, and PCA of the combined data matrix showed improved discrimination of the three types of object; close associations between the GC–MS and HPLC–DAD variables were observed. LDA (linear discriminant analysis), BP-ANN (back propagation-artificial neural networks) and LS-SVM (least squares-support vector machine) chemometrics methods were then applied to classify the training and prediction sets. For one-dimensional matrices, all training models indicated that several samples would be misclassified; the same was observed for each prediction set. However, by comparison, in the analysis of the combined matrix, all models gave 100% classification with the training set, and the LS-SVM calibration also produced a 100% result for prediction, with the BP-ANN calibration closely behind. This has important implications for comparing complex substances such as the TCMs because clearly the one-dimensional data matrices alone produce inferior results for training and prediction as compared to the combined data matrix models. Thus, product samples may be misclassified with the use of the one-dimensional data because of insufficient information.  相似文献   

9.
Drug-likeness prediction is important for the virtual screening of drug candidates. It is challenging because the drug-likeness is presumably associated with the whole set of necessary properties to pass through clinical trials, and thus no definite data for regression is available. Recently, binary classification models based on graph neural networks have been proposed but with strong dependency of their performances on the choice of the negative set for training. Here we propose a novel unsupervised learning model that requires only known drugs for training. We adopted a language model based on a recurrent neural network for unsupervised learning. It showed relatively consistent performance across different datasets, unlike such classification models. In addition, the unsupervised learning model provides drug-likeness scores that well separate distributions with increasing mean values in the order of datasets composed of molecules at a later step in a drug development process, whereas the classification model predicted a polarized distribution with two extreme values for all datasets presumably due to the overconfident prediction for unseen data. Thus, this new concept offers a pragmatic tool for drug-likeness scoring and further can be applied to other biochemical applications.

A new quantification method of drug-likeness based on unsupervised learning. The method only uses drug molecules as training set without any non-drug-like molecules.  相似文献   

10.
11.
Public domain and commercial in silico tools were compared for their performance in predicting the skin sensitization potential of chemicals. The packages were either statistical based (Vega, CASE Ultra) or rule based (OECD Toolbox, Toxtree, Derek Nexus). In practice, several of these in silico tools are used in gap filling and read-across, but here their use was limited to make predictions based on presence/absence of structural features associated to sensitization. The top 400 ranking substances of the ATSDR 2011 Priority List of Hazardous Substances were selected as a starting point. Experimental information was identified for 160 chemically diverse substances (82 positive and 78 negative). The prediction for skin sensitization potential was compared with the experimental data. Rule-based tools perform slightly better, with accuracies ranging from 0.6 (OECD Toolbox) to 0.78 (Derek Nexus), compared with statistical tools that had accuracies ranging from 0.48 (Vega) to 0.73 (CASE Ultra – LLNA weak model). Combining models increased the performance, with positive and negative predictive values up to 80% and 84%, respectively. However, the number of substances that were predicted positive or negative for skin sensitization in both models was low. Adding more substances to the dataset will increase the confidence in the conclusions reached. The insights obtained in this evaluation are incorporated in a web database www.asopus.weebly.com that provides a potential end user context for the scope and performance of different in silico tools with respect to a common dataset of curated skin sensitization data.  相似文献   

12.
为了实现对法庭科学领域重质矿物油物证的快速、准确、无损的鉴定,该文基于光谱分析技术提出了一种多阶导数光谱数据组合分析的方法。收集了80种不同型号、不同厂家的重质矿物油样本,利用傅里叶变换拉曼光谱分析法采集样本的原始光谱数据和导数光谱数据,并通过结合化学计量学构建分类模型。在构建的主成分分析(PCA)结合径向基函数神经网络(RBF)分类模型中,对单独的原始光谱、一阶导数谱和二阶导数谱数据的训练集准确率分别为80.0%、86.7%和86.2%,测试集准确率分别为73.3%、80.0%和72.7%;对组合后的原始光谱+一阶导数谱、原始光谱+二阶导数谱和一阶导数谱+二阶导数谱数据的分类中,训练集准确率分别为97.0%、96.7%和100%,测试集准确率分别为85.7%、90.0%和100%。结果表明,对组合后的导数光谱与原始光谱构建分类模型,准确率更高。其中,基于一阶导数谱+二阶导数谱数据构建的PCA结合RBF分类模型的结果最为理想,准确率达100%。而K最近邻算法模型由于受到样本不均匀的影响,整体分类准确率均较低。利用组合的导数光谱与原始光谱数据构建分类模型能够实现对重质矿物油样本的快速、准确、无损鉴别,可为光谱组合技术在法庭科学及其他分析测试领域的应用提供一定的借鉴和参考。  相似文献   

13.
《Analytical letters》2012,45(6):1209-1226
Abstract

A sensitive method for the simultaneous spectrophotometric determination of Fe(II), Cu(II), Zn(II), and Mn(II) in mixtures has been developed with the aid of multivariate calibration methods, such as classical least squares (CLS), principal component regression (PCR) and partial least squares (PLS). The method is based on the spectral differences of the analytes in their complexation reaction with 4‐(2‐pyridylazo)‐resorcinol (PAR) and the use of full spectra with wavelengths in the range of 300–600 nm. It was found that both the spectral positive and negative bands obtained against the PAR blank, are proportional to the concentration for each metal complex. The obtained linear calibration concentration ranges are 0.025–0.6, 0.05–0.8, 0.025–0.8, and 0.05–0.8 µg ml?1 for Fe(II), Cu(II), Zn(II), and Mn(II), respectively, and the LODs for the four metal ions were found to be approximately 1–3×10?2 µg ml?1. The proposed method was applied to a verification set of synthetic mixtures of these four metal ions, with models built in three different wavelength ranges, i.e., 300–450, 450–600, and 300–600 nm, corresponding to the positive, negative bands and their combinations, respectively. It was shown that the PLS model for the 300–600 nm range gave the best results (RPET=6.9% and average recovery ~100%; cf. PCR: RPET=9.5% and average Recovery ~110%). This method was also successfully applied for the determination of the four metal ions in pharmaceutical preparations, chicken feedstuff, and water samples.  相似文献   

14.
M. Carsky  D.D. Do 《Adsorption》1999,5(3):183-192
Three neural network models were used for prediction of adsorption equilibria of binary vapour mixtures on an activated carbon. The predictions were compared both with published experimental data and calculated values from the Ideal Adsorption Solution (IAS) model. The neural network was trained using both binary and single component experimental adsorption data. Even for a limited number of data points (about 60) the network models were capable of approximating experimental data very precisely.  相似文献   

15.
The Ames mutagenicity test in Salmonella typhimurium is a bacterial short-term in vitro assay aimed at detecting the mutagenicity caused by chemicals. Mutagenicity is considered as an early alert for carcinogenicity. After a number of decades, several (Q)SAR studies on this endpoint yielded enough evidence to make feasible the construction of reliable computational models for prediction of mutagenicity from the molecular structure of chemicals. In this study, we propose a combination of a fragment-based SAR model and an inductive database. The hybrid system was developed using a collection of 4337 chemicals (2401 mutagens and 1936 nonmutagens) and tested using 753 independent compounds (437 mutagens and 316 nonmutagens). The overall error of this system on the external test set compounds is 15% (sensitivity = 15%, specificity = 15%), which is quantitatively similar to the experimental error of Ames test data (average interlaboratory reproducibility determined by the National Toxicology Program). Moreover, each single prediction is provided with a specific confidence level. The results obtained give confidence that this system can be applied to support early and rapid evaluation of the level of mutagenicity concern.  相似文献   

16.
《Fluid Phase Equilibria》2004,217(2):157-164
Experimental isothermal Px data at T=313.15 K for the binary systems 1,1-dimethylethyl methyl ether (MTBE)+n-hexane and methanol+n-hexane, and the ternary system MTBE+methanol+n-hexane are reported. Data reduction by Barker’s method provides correlations for GE using the Margules equation for the binary systems and the Wohl expansion for the ternary system. Wilson, NRTL and UNIQUAC models have been applied successfully to both the binary and the ternary systems. Moreover, we compare the experimental results for these binary mixtures to the prediction of the UNIFAC (Dortmund) model. Experimental results have been compared to predictions for the ternary system obtained from the Wilson, NRTL, UNIQUAC and UNIFAC models; for the ternary system, the UNIFAC predictions seem poor. The presence of azeotropes in the binary systems has been studied.  相似文献   

17.
This work describes multi-classification based on binary probabilistic discriminant partial least squares (p-DPLS) models, developed with the strategy one-against-one and the principle of winner-takes-all. The multi-classification problem is split into binary classification problems with p-DPLS models. The results of these models are combined to obtain the final classification result. The classification criterion uses the specific characteristics of an object (position in the multivariate space and prediction uncertainty) to estimate the reliability of the classification, so that the object is assigned to the class with the highest reliability. This new methodology is tested with the well-known Iris data set and a data set of Italian olive oils. When compared with CART and SIMCA, the proposed method has better average performance of classification, besides giving a statistic that evaluates the reliability of classification. For the olive oil set the average percentage of correct classification for the training set was close to 84% with p-DPLS against 75% with CART and 100% with SIMCA, while for the test set the average was close to 94% with p-DPLS as against 50% with CART and 62% with SIMCA.  相似文献   

18.
Accurate in silico models for the quantitative prediction of the activity of G protein-coupled receptor (GPCR) ligands would greatly facilitate the process of drug discovery and development. Several methodologies have been developed based on the properties of the ligands, the direct study of the receptor-ligand interactions, or a combination of both approaches. Ligand-based three-dimensional quantitative structure-activity relationships (3D-QSAR) techniques, not requiring knowledge of the receptor structure, have been historically the first to be applied to the prediction of the activity of GPCR ligands. They are generally endowed with robustness and good ranking ability; however they are highly dependent on training sets. Structure-based techniques generally do not provide the level of accuracy necessary to yield meaningful rankings when applied to GPCR homology models. However, they are essentially independent from training sets and have a sufficient level of accuracy to allow an effective discrimination between binders and nonbinders, thus qualifying as viable lead discovery tools. The combination of ligand and structure-based methodologies in the form of receptor-based 3D-QSAR and ligand and structure-based consensus models results in robust and accurate quantitative predictions. The contribution of the structure-based component to these combined approaches is expected to become more substantial and effective in the future, as more sophisticated scoring functions are developed and more detailed structural information on GPCRs is gathered.  相似文献   

19.
In this paper, a genetic algorithm‐support vector regression (GA‐SVR) coupled approach was proposed for investigating the relationship between fingerprints and properties of herbal medicines. GA was used to select variables so as to improve the predictive ability of the models. Two other widely used approaches, Random Forests (RF) and partial least squares regression (PLSR) combined with GA (namely GA‐RF and GA‐PLSR, respectively), were also employed and compared with the GA‐SVR method. The models were evaluated in terms of the correlation coefficient between the measured and predicted values (Rp), root mean square error of prediction, and root mean square error of leave‐one‐out cross‐validation. The performance has been tested on a simulated system, a chromatographic data set, and a near‐infrared spectroscopic data set. The obtained results indicate that the GA‐SVR model provides a more accurate answer, with higher Rp and lower root mean square error. The proposed method is suitable for the quantitative analysis and quality control of herbal medicines. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

20.
Fluorescence spectrum, as well as the first and second derivative spectra in the region of 220–900 nm, was utilized to determine the concentration of triglyceride in human serum. Nonlinear partial least squares regression with cubic B‐spline‐function‐based nonlinear transformation was employed as the chemometric method. Window genetic algorithms partial least squares (WGAPLS) was proposed as a new wavelength selection method to find the optimized spectra wavelengths combination. Study shows that when WGAPLS is applied within the optimized regions ascertained by changeable size moving window partial least squares (CSMWPLS) or searching combination moving window partial least squares (SCMWPLS), the calibration and prediction performance of the model can be further improved at a reasonable latent variable number. SCMWPLS should start from the sub‐region found by CSMWPLS with the smallest root mean squares error of calibration (RMSEC). In addition, WGAPLS should be utilized within the region of smallest RMSEC whether it is the sub‐region found by CSMWPLS or region combination found by SCMWPLS. Moreover, the prediction ability of nonlinear models was better than the linear models significantly. The prediction performance of the three spectra was in the following order: second derivative spectrum < original spectrum < first derivative spectrum. Wavelengths within the region of 300–367 nm and 386–392 nm in the first derivative of the original fluorescence spectrum were the optimized wavelength combination for the prediction model. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号