首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the emergence of combinatorial chemistry, whether based on parallel, mixture, solution, or solid phase chemistry, it is now possible to generate large numbers of diverse or focused compound libraries. In this paper we aim to demonstrate that it is possible to design targeted libraries by applying nonparametric statistical methods, recursive partitioning in particular, to large data sets containing thousands of compounds and their associated biological data. Moreover, when applied to an experimental high-throughput screening (HTS) data set, our data strongly suggest that this method can improve the hit rate of our primary screens (about 4- to 5-fold) while increasing screening efficiency: less than one-fifth of the complete selection needs to be screened in order to identify about 75% of all actives present.  相似文献   

2.
Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT+) and non-genotoxicity (GT−) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT+ and GT− compounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT+ in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT+ in in vivo or Ames test only). H-SVM trained by 4,763 GT+ compounds reported before 2008 and 8,232 GT− compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT+ compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT−, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT+. These are comparable to the 43.1–51.9% GT+ and 75–93% GT− rates of existing in-silico methods, 58.8% GT+ and 79% GT− rates of Ames method, and the estimated percentages of 23% in vivo and 31–33% in vitro GT+ compounds in the “universe of chemicals”. There is a substantial level of agreement between H-SVM and L-SVM predicted GT+ and GT− MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT+ compounds from large compound libraries based on higher diversity and higher noise training data.  相似文献   

3.
Aqueous solubility is recognized as a critical parameter in both the early- and late-stage drug discovery. Therefore, in silico modeling of solubility has attracted extensive interests in recent years. Most previous studies have been limited in using relatively small data sets with limited diversity, which in turn limits the predictability of derived models. In this work, we present a support vector machines model for the binary classification of solubility by taking advantage of the largest known public data set that contains over 46?000 compounds with experimental solubility. Our model was optimized in combination with a reduction and recombination feature selection strategy. The best model demonstrated robust performance in both cross-validation and prediction of two independent test sets, indicating it could be a practical tool to select soluble compounds for screening, purchasing, and synthesizing. Moreover, our work may be used for comparative evaluation of solubility classification studies ascribe to the use of completely public resources.  相似文献   

4.
The need for rapid and accurate detection systems is expanding and the utilization of cross-reactive sensor arrays to detect chemical warfare agents in conjunction with novel computational techniques may prove to be a potential solution to this challenge. We have investigated the detection, prediction, and classification of various organophosphate (OP) nerve agent simulants using sensor arrays with a novel learning scheme known as support vector machines (SVMs). The OPs tested include parathion, malathion, dichlorvos, trichlorfon, paraoxon, and diazinon. A new data reduction software program was written in MATLAB V. 6.1 to extract steady-state and kinetic data from the sensor arrays. The program also creates training sets by mixing and randomly sorting any combination of data categories into both positive and negative cases. The resulting signals were fed into SVM software for "pairwise" and "one" vs all classification. Experimental results for this new paradigm show a significant increase in classification accuracy when compared to artificial neural networks (ANNs). Three kernels, the S2000, the polynomial, and the Gaussian radial basis function (RBF), were tested and compared to the ANN. The following measures of performance were considered in the pairwise classification: receiver operating curve (ROC) Az indices, specificities, and positive predictive values (PPVs). The ROC Az) values, specifities, and PPVs increases ranged from 5% to 25%, 108% to 204%, and 13% to 54%, respectively, in all OP pairs studied when compared to the ANN baseline. Dichlorvos, trichlorfon, and paraoxon were perfectly predicted. Positive prediction for malathion was 95%.  相似文献   

5.
It is known that in the three-dimensional structure of a protein, certain amino acids can interact with each other in order to provide structural integrity or aid in its catalytic function. If these positions are mutated the loss of this interaction usually leads to a non-functional protein. Directed evolution experiments, which probe the sequence space of a protein through mutations in search for an improved variant, frequently result in such inactive sequences. In this work, we address the use of machine learning algorithms, Boolean learning and support vector machines (SVMs), to find such pairs of amino acid positions. The recombination method of imparting mutations was simulated to create in silico sequences that were used as training data for the algorithms. The two algorithms were combined together to develop an approach that weighs the structural risk as well as the empirical risk to solve the problem. This strategy was adapted to a multi-round framework of experiments where the data generated in the present round is used to design experiments for the next round to improve the generated library, as well as the estimation of the interacting positions. It is observed that this strategy can greatly improve the number of functional variants that are generated as well as the average number of mutations that can be made in the library.  相似文献   

6.
Compared to the current knowledge on cancer chemotherapeutic agents, only limited information is available on the ability of organic compounds, such as drugs and/or natural products, to prevent or delay the onset of cancer. In order to evaluate chemical chemopreventive potentials and design novel chemopreventive agents with low to no toxicity, we developed predictive computational models for chemopreventive agents in this study. First, we curated a database containing over 400 organic compounds with known chemoprevention activities. Based on this database, various random forest and support vector machine binary classifiers were developed. All of the resulting models were validated by cross validation procedures. Then, the validated models were applied to virtually screen a chemical library containing around 23,000 natural products and derivatives. We selected a list of 148 novel chemopreventive compounds based on the consensus prediction of all validated models. We further analyzed the predicted active compounds by their ease of organic synthesis. Finally, 18 compounds were synthesized and experimentally validated for their chemopreventive activity. The experimental validation results paralleled the cross validation results, demonstrating the utility of the developed models. The predictive models developed in this study can be applied to virtually screen other chemical libraries to identify novel lead compounds for the chemoprevention of cancers.  相似文献   

7.
8.
9.
Statistical learning methods have been used in developing filters for predicting inhibitors of two P450 isoenzymes, CYP3A4 and CYP2D6. This work explores the use of different statistical learning methods for predicting inhibitors of these enzymes and an additional P450 enzyme, CYP2C9, and the substrates of the three P450 isoenzymes. Two consensus support vector machine (CSVM) methods, "positive majority" (PM-CSVM) and "positive probability" (PP-CSVM), were used in this work. These methods were first tested for the prediction of inhibitors of CYP3A4 and CYP2D6 by using a significantly higher number of inhibitors and noninhibitors than that used in earlier studies. They were then applied to the prediction of inhibitors of CYP2C9 and substrates of the three enzymes. Both methods predict inhibitors of CYP3A4 and CYP2D6 at a similar level of accuracy as those of earlier studies. For classification of inhibitors of CYP2C9, the best CSVM method gives an accuracy of 88.9% for inhibitors and 96.3% for noninhibitors. The accuracies for classification of substrates and nonsubstrates of CYP3A4, CYP2D6, and CYP2C9 are 98.2 and 90.9%, 96.6 and 94.4%, and 85.7 and 98.8%, respectively. Both CSVM methods are potentially useful as filters for predicting inhibitors and substrates of P450 isoenzymes. These methods generally give better accuracies than single SVM classification systems, and the performance of the PP-CSVM method is slightly better than that of the PM-CSVM method.  相似文献   

10.
11.
12.
Ren S  Gao L 《The Analyst》2011,136(6):1252-1261
This paper suggests a novel method named DF-LS-SVM, which is based on least squares support vector machines (LS-SVM) regression combined with data fusion (DF) to enhance the ability to extract characteristic information and improve the quality of the regression. Simultaneous multicomponent determination of Fe(III), Co(II) and Cu(II) was conducted for the first time by using the proposed method. Data fusion is a technique that integrates information from disparate sources to produce a single model or decision. The LS-SVM technique allows for learning a high-dimensional feature with fewer training data, and reduces the computational complexity by only requiring the solution of a set of linear equations instead of a quadratic programming problem. Experimental results showed that the DF-LS-SVM method was successful for simultaneous multicomponent determination even when severe overlap of spectra existed. The DF-LS-SVM method is an attractive and promising hybrid approach that combines the best properties of the two techniques. The results obtained from an additional test case, simultaneous differential pulse voltammetric determination of o-nitrophenol, m-nitrophenol and p-nitrophenol, also demonstrated that the DF-LS-SVM method performed somewhat better than LS-SVM and PLS methods.  相似文献   

13.
14.
The least squares support vector machines (LS-SVM) was used to model infrared spectral data for TSH hormone secreted by thyroid, which regulates the basal metabolic rate. This model was used for direct estimation of the content of TSH in blood serum samples, and the results were comparable with those obtained with the conventional analytical method based on chemoluminescence methodology. Excellent agreement was observed between the conventional method and the newly developed calibration model based in analysis of spectral data with LS-SVM. The latter has clear advantages, because it is fast and requires no reagent once the measurements were done directly in the serum by using a simple mid-infrared spectrometer in the ATR mode. An important advantage observed in this calibration method based on LS-SVM is the remarkable capacity to avoid overfitting in the model-building step, that is, the developed method is highly robust.  相似文献   

15.
Support vector machines (SVMs) were used as a novel learning machine in the authentication of the origin of salmon. SVMs have the advantage of relying on a well-developed theory and have already proved to be successful in a number of practical applications. This paper provides a new and effective method for the discrimination between wild and farm salmon and eliminates the possibility of fraud through misrepresentation of the country of origin of salmon. The method requires a very simple sample preparation of the fish oils extracted from the white muscle of salmon samples. (1)H NMR spectroscopic analysis provides data that is very informative for analysing the fatty acid constituents of the fish oils. The SVM has been able to distinguish correctly between the wild and farmed salmon; however ca. 5% of the country of origins were misclassified.  相似文献   

16.
Ternary mixtures of thiamin, riboflavin and pyridoxal have been simultaneously determined in synthetic and real samples by applications of spectrophotometric and least-squares support vector machines. The calibration graphs were linear in the ranges of 1.0 - 20.0, 1.0 - 10.0 and 1.0 - 20.0 microg ml(-1) with detection limits of 0.6, 0.5 and 0.7 microg ml(-1) for thiamin, riboflavin and pyridoxal, respectively. The experimental calibration matrix was designed with 21 mixtures of these chemicals. The concentrations were varied between calibration graph concentrations of vitamins. The simultaneous determination of these vitamin mixtures by using spectrophotometric methods is a difficult problem, due to spectral interferences. The partial least squares (PLS) modeling and least-squares support vector machines were used for the multivariate calibration of the spectrophotometric data. An excellent model was built using LS-SVM, with low prediction errors and superior performance in relation to PLS. The root mean square errors of prediction (RMSEP) for thiamin, riboflavin and pyridoxal with PLS and LS-SVM were 0.6926, 0.3755, 0.4322 and 0.0421, 0.0318, 0.0457, respectively. The proposed method was satisfactorily applied to the rapid simultaneous determination of thiamin, riboflavin and pyridoxal in commercial pharmaceutical preparations and human plasma samples.  相似文献   

17.
18.
In this paper, the performance of new clustering methods such as Neural Gas (NG) and Growing Neural Gas (GNG) is compared with the K-means method for real and simulated data sets. Moreover, a new algorithm called growing K-means, GK, is introduced as the alternative to Neural Gas and Growing Neural Gas. It has small input requirements and is conceptually very simple. The GK leads to nearly optimal values of the cost function, and, contrary to K-means, it is independent of the initial data set partition. The incremental property of GK additionally helps to estimate the number of "natural" clusters in data, i.e., the well-separated groups of objects in the data space.  相似文献   

19.
Wang G  Sun YA  Ding Q  Dong C  Fu D  Li C 《Analytica chimica acta》2007,594(1):101-106
A method that use kernel independent component analysis (KICA) and support vector regression (SVR) was proposed for estimation of source ultraviolet (UV) spectra profiles and simultaneous determination of polycomponents in mixtures. In KICA-SVR procedure, the UV source spectra profiles were estimated using KICA, then the mixing matrix of the components were calculated using the estimated sources, and the calibration model was build using SVR based on the calculated mixing matrix. A simulated UV dataset of three-component mixtures was used to test the ability of KICA for estimating source spectra profiles from spectra data of mixtures. It was found that KICA has the potential power to estimate pure UV spectra profiles, and correlation coefficient of estimated sources correspond to the real adopted ones are better compared with that by FastICA and Infomax ICA. An UV dataset of polycomponent vitamin B was processed using the proposed KICA-SVR method. The results show that the estimated source spectra profiles are correlative with the real UV spectra of the components and chemically interpretable, and accurate results were obtained.  相似文献   

20.
In order to understand the molecular mechanism underlying any disease, knowledge about the interacting proteins in the disease pathway is essential. The number of revealed protein-protein interactions (PPI) is still very limited compared to the available protein sequences of different organisms. Experiment based high-throughput technologies though provide some data about these interactions, those are often fairly noisy. Computational techniques for predicting protein–protein interactions therefore assume significance. 1296 binary fingerprints that encode a combination of structural and geometric properties were developed using the crystallographic data of 15,000 protein complexes in the pdb server. In a case study, these fingerprints were created for proteins implicated in the Type 2 diabetes mellitus disease. The fingerprints were input into a SVM based model for discriminating disease proteins from non disease proteins yielding a classification accuracy of 78.2% (AUC value of 0.78) on an external data set composed of proteins retrieved via text mining of diabetes related literature. A PPI network was constructed and analysed to explore new disease targets. The integrated approach exemplified here has a potential for identifying disease related proteins, functional annotation and other proteomics studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号