期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Ensemble methods for classification in cheminformatics

Merkwirth C Mauser H Schulz-Gasch T Roche O Stahl M Lengauer T 《Journal of chemical information and computer sciences》2004,44(6):1971-1978

We describe the application of ensemble methods to binary classification problems on two pharmaceutical compound data sets. Several variants of single and ensembles models of k-nearest neighbors classifiers, support vector machines (SVMs), and single ridge regression models are compared. All methods exhibit robust classification even when more features are given than observations. On two data sets dealing with specific properties of drug-like substances (cytochrome P450 inhibition and "Frequent Hitters", i.e., unspecific protein inhibition), we achieve classification rates above 90%. We are able to reduce the cross-validated misclassification rate for the Frequent Hitters problem by a factor of 2 compared to previous results obtained for the same data set with different modeling techniques. 相似文献

2.

Modeling and benchmark data set for the inhibition of c-Jun N-terminal kinase-3

Schattel V Hinselmann G Jahn A Zell A Laufer S 《Journal of chemical information and modeling》2011,51(3):670-679

The goal of this paper is to present and describe a novel 2D- and 3D-QSAR (quantitative structure-activity relationship) binary classification data set for the inhibition of c-Jun N-terminal kinase-3 with previously unpublished activities for a diverse set of compounds. JNK3 is an important pharmaceutical target because it is involved in many neurological disorders. Accordingly, the development of JNK3 inhibitors has gained increasing interest. 2D and 3D versions of the data set were used, consisting of 313 (70 actives) and 249 (60 actives) compounds, respectively. All compounds, for which activity was only determined for the racemate, were removed from the 3D data set. We investigated the diversity of the data sets by an agglomerative clustering with feature trees and show that the data set contains several different scaffolds. Furthermore, we show that the benchmarks can be tackled with standard supervised learning algorithms with a convincing performance. For the 2D problem, a random decision forest classifier achieves a Matthew's correlation coefficient of 0.744, the 3D problem could be modeled with a Matthew's correlation coefficient of 0.524 with 3D pharmacophores and a support vector machine. The performance of both data sets was evaluated within a nested 10-fold cross-validation. We therefore suggest that the data set is a reasonable basis for generating QSAR models for JNK3 because of its diverse composition and the performance of the classifiers presented in this study. 相似文献

3.

Comparison of combinatorial clustering methods on pharmacological data sets represented by machine learning-selected real molecular descriptors

Rivera-Borroto OM Marrero-Ponce Y García-de la Vega JM Grau-Ábalo Rdel C 《Journal of chemical information and modeling》2011,51(12):3036-3049

相似文献

4.

Predictive activity profiling of drugs by topological-fragment-spectra-based support vector machines

Kawai K Fujishima S Takahashi Y 《Journal of chemical information and modeling》2008,48(6):1152-1160

Aiming at the prediction of pleiotropic effects of drugs, we have investigated the multilabel classification of drugs that have one or more of 100 different kinds of activity labels. Structural feature representation of each drug molecule was based on the topological fragment spectra method, which was proposed in our previous work. Support vector machine (SVM) was used for the classification and the prediction of their activity classes. Multilabel classification was carried out by a set of the SVM classifiers. The collective SVM classifiers were trained with a training set of 59,180 compounds and validated by another set (validation set) of 29,590 compounds. For a test set that consists of 9,864 compounds, the classifiers correctly classified 80.8% of the drugs into their own active classes. The SVM classifiers also successfully performed predictions of the activity spectra for multilabel compounds. 相似文献

5.

Predicting drug-target interaction network using deep learning model

《Computational Biology and Chemistry》2019

相似文献

6.

Predicting experimental properties of proteins from sequence by machine learning techniques

Smialowski P Martin-Galiano AJ Cox J Frishman D 《Current protein & peptide science》2007,8(2):121-133

Efficient target selection methods are an important prerequisite for increasing the success rate and reducing the cost of high-throughput structural genomics efforts. There is a high demand for sequence-based methods capable of predicting experimentally tractable proteins and filtering out potentially difficult targets at different stages of the structural genomic pipeline. Simple empirical rules based on anecdotal evidence are being increasingly superseded by rigorous machine-learning algorithms. Although the simplicity of less advanced methods makes them more human understandable, more sophisticated formalized algorithms possess superior classification power. The quickly growing corpus of experimental success and failure data gathered by structural genomics consortia creates a unique opportunity for retrospective data mining using machine learning techniques and results in increased quality of classifiers. For example, the current solubility prediction methods are reaching the accuracy of over 70%. Furthermore, automated feature selection leads to better insight into the nature of the correlation between amino acid sequence and experimental outcome. In this review we summarize methods for predicting experimental success in cloning, expression, soluble expression, purification and crystallization of proteins with a special focus on publicly available resources. We also describe experimental data repositories and machine learning techniques used for classification and feature selection. 相似文献

7.

Ligand-based models for the isoform specificity of cytochrome P450 3A4, 2D6, and 2C9 substrates

Terfloth L Bienfait B Gasteiger J 《Journal of chemical information and modeling》2007,47(4):1688-1701

相似文献

8.

A support vector machine approach to classify human cytochrome P450 3A4 inhibitors

Kriegl JM Arnhold T Beck B Fox T 《Journal of computer-aided molecular design》2005,19(3):189-201

相似文献

9.

Diagnostic pattern recognition on gene-expression profile data by using one-class classification

Xu Y Brereton RG 《Journal of chemical information and modeling》2005,45(5):1392-1401

相似文献

10.

Information-theoretic approaches to SVM feature selection for metagenome read classification

Garbarine E DePasquale J Gadia V Polikar R Rosen G 《Computational Biology and Chemistry》2011,35(3):199-209

Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels. 相似文献

11.

Merits of random forests emerge in evaluation of chemometric classifiers by external validation

I.M. Scott W. Lin M. Liakata J.E. Wood C.P. Vermeer D. Allaway J.L. Ward J. Draper M.H. Beale D.I. Corol J.M. Baker R.D. King 《Analytica chimica acta》2013

Real-world applications will inevitably entail divergence between samples on which chemometric classifiers are trained and the unknowns requiring classification. This has long been recognized, but there is a shortage of empirical studies on which classifiers perform best in ‘external validation’ (EV), where the unknown samples are subject to sources of variation relative to the population used to train the classifier. Survey of 286 classification studies in analytical chemistry found only 6.6% that stated elements of variance between training and test samples. Instead, most tested classifiers using hold-outs or resampling (usually cross-validation) from the same population used in training. The present study evaluated a wide range of classifiers on NMR and mass spectra of plant and food materials, from four projects with different data properties (e.g., different numbers and prevalence of classes) and classification objectives. Use of cross-validation was found to be optimistic relative to EV on samples of different provenance to the training set (e.g., different genotypes, different growth conditions, different seasons of crop harvest). For classifier evaluations across the diverse tasks, we used ranks-based non-parametric comparisons, and permutation-based significance tests. Although latent variable methods (e.g., PLSDA) were used in 64% of the surveyed papers, they were among the less successful classifiers in EV, and orthogonal signal correction was counterproductive. Instead, the best EV performances were obtained with machine learning schemes that coped with the high dimensionality (914–1898 features). Random forests confirmed their resilience to high dimensionality, as best overall performers on the full data, despite being used in only 4.5% of the surveyed papers. Most other machine learning classifiers were improved by a feature selection filter (ReliefF), but still did not out-perform random forests. 相似文献

12.

Filter feature selectors in the development of binary QSAR models

G. Cerruela García J. Pérez-Parras Toledano A. de Haro García N. García-Pedrajas 《SAR and QSAR in environmental research》2019,30(5):313-345

The application of machine learning methods to the construction of quantitative structure–activity relationship models is a complex computational problem in which dimensionality reduction of the representation of the molecular structure plays a fundamental role in predicting a target activity. The feature selection pre-processing approach has been indicated to be effective in dimensionality reduction for building simpler and more understandable models. In this paper, a performance comparative study of 13 state-of-the-art feature selection filter methods is conducted. Structure–activity relationship models are constructed using three widely used classifiers and a diverse collection of datasets. The comparative study utilizes robust statistical tests to compare the algorithms. According to the experimental results, there are substantial differences in performance among the evaluated feature selection methods. The methods that exhibit the best performance are correlation-based feature selection, fast clustering-based feature selection and the set cover method. 相似文献

13.

Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics

Hinselmann G Rosenbaum L Jahn A Fechner N Ostermann C Zell A 《Journal of chemical information and modeling》2011,51(2):203-213

The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Nai?ve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches. 相似文献

14.

Considerations and recent advances in QSAR models for cytochrome P450-mediated drug metabolism prediction

Li H Sun J Fan X Sui X Zhang L Wang Y He Z 《Journal of computer-aided molecular design》2008,22(11):843-855

Quantitative structure–activity relationships (QSAR) methods are urgently needed for predicting ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties to select lead compounds for optimization at the early stage of drug discovery, and to screen drug candidates for clinical trials. Use of suitable QSAR models ultimately results in lesser time-cost and lower attrition rate during drug discovery and development. In the case of ADME/T parameters, drug metabolism is a key determinant of metabolic stability, drug–drug interactions, and drug toxicity. QSAR models for predicting drug metabolism have undergone significant advances recently. However, most of the models used lack sufficient interpretability and offer poor predictability for novel drugs. In this review, we describe some considerations to be taken into account by QSAR for modeling drug metabolism, such as the accuracy/consistency of the entire data set, representation and diversity of the training and test sets, and variable selection. We also describe some novel statistical techniques (ensemble methods, multivariate adaptive regression splines and graph machines), which are not yet used frequently to develop QSAR models for drug metabolism. Subsequently, rational recommendations for developing predictable and interpretable QSAR models are made. Finally, the recent advances in QSAR models for cytochrome P450-mediated drug metabolism prediction, including in vivo hepatic clearance, in vitro metabolic stability, inhibitors and substrates of cytochrome P450 families, are briefly summarized. 相似文献

15.

SVM-based feature selection for characterization of focused compound collections

Byvatov E Schneider G 《Journal of chemical information and computer sciences》2004,44(3):993-999

相似文献

16.

A maximum common subgraph kernel method for predicting the chromosome aberration test 总被引：1，自引：0，他引：1

Mohr J Jain B Sutter A Laak AT Steger-Hartmann T Heinrich N Obermayer K 《Journal of chemical information and modeling》2010,50(10):1821-1838

The chromosome aberration test is frequently used for the assessment of the potential of chemicals and drugs to elicit genetic damage in mammalian cells in vitro. Due to the limitations of experimental genotoxicity testing in early drug discovery phases, a model to predict the chromosome aberration test yielding high accuracy and providing guidance for structure optimization is urgently needed. In this paper, we describe a machine learning approach for predicting the outcome of this assay based on the structure of the investigated compound. The novelty of the proposed method consists in combining a maximum common subgraph kernel for measuring the similarity of two chemical graphs with the potential support vector machine for classification. In contrast to standard support vector machine classifiers, the proposed approach does not provide a black box model but rather allows to visualize structural elements with high positive or negative contribution to the class decision. In order to compare the performance of different methods for predicting the outcome of the chromosome aberration test, we compiled a large data set exhibiting high quality, reliability, and consistency from public sources and configured a fixed cross-validation protocol, which we make publicly available. In a comparison to standard methods currently used in pharmaceutical industry as well as to other graph kernel approaches, the proposed method achieved significantly better performance. 相似文献

17.

Objective Supervised Machine Learning-Based Classification and Inference of Biological Neuronal Networks

Michael Taynnan Barros Harun Siljak Peter Mullen Constantinos Papadias Jari Hyttinen Nicola Marchetti 《Molecules (Basel, Switzerland)》2022,27(19)

The classification of biological neuron types and networks poses challenges to the full understanding of the human brain’s organisation and functioning. In this paper, we develop a novel objective classification model of biological neuronal morphology and electrical types and their networks, based on the attributes of neuronal communication using supervised machine learning solutions. This presents advantages compared to the existing approaches in neuroinformatics since the data related to mutual information or delay between neurons obtained from spike trains are more abundant than conventional morphological data. We constructed two open-access computational platforms of various neuronal circuits from the Blue Brain Project realistic models, named Neurpy and Neurgen. Then, we investigated how we could perform network tomography with cortical neuronal circuits for the morphological, topological and electrical classification of neurons. We extracted the simulated data of 10,000 network topology combinations with five layers, 25 morphological type (m-type) cells, and 14 electrical type (e-type) cells. We applied the data to several different classifiers (including Support Vector Machine (SVM), Decision Trees, Random Forest, and Artificial Neural Networks). We achieved accuracies of up to 70%, and the inference of biological network structures using network tomography reached up to 65% of accuracy. Objective classification of biological networks can be achieved with cascaded machine learning methods using neuron communication data. SVM methods seem to perform better amongst used techniques. Our research not only contributes to existing classification efforts but sets the road-map for future usage of brain–machine interfaces towards an in vivo objective classification of neurons as a sensing mechanism of the brain’s structure. 相似文献

18.

Predictive models for cytochrome p450 isozymes based on quantitative high throughput screening data

Sun H Veith H Xia M Austin CP Huang R 《Journal of chemical information and modeling》2011,51(10):2474-2481

The human cytochrome P450 (CYP450) isozymes are the most important enzymes in the body to metabolize many endogenous and exogenous substances including environmental toxins and therapeutic drugs. Any unnecessary interactions between a small molecule and CYP450 isozymes may raise a potential to disarm the integrity of the protection. Accurately predicting the potential interactions between a small molecule and CYP450 isozymes is highly desirable for assessing the metabolic stability and toxicity of the molecule. The National Institutes of Health Chemical Genomics Center (NCGC) has screened a collection of over 17,000 compounds against the five major isozymes of CYP450 (1A2, 2C9, 2C19, 2D6, and 3A4) in a quantitative high throughput screening (qHTS) format. In this study, we developed support vector classification (SVC) models for these five isozymes using a set of customized generic atom types. The CYP450 data sets were randomly split into equal-sized training and test sets. The optimized SVC models exhibited high predictive power against the test sets for all five CYP450 isozymes with accuracies of 0.93, 0.89, 0.89, 0.85, and 0.87 for 1A2, 2C9, 2C19, 2D6, and 3A4, respectively, as measured by the area under the receiver operating characteristic (ROC) curves. The important atom types and features extracted from the five models are consistent with the structural preferences for different CYP450 substrates reported in the literature. We also identified novel features with significant discerning power to separate CYP450 actives from inactives. These models can be useful in prioritizing compounds in a drug discovery pipeline or recognizing the toxic potential of environmental chemicals. 相似文献

19.

Predicting cis/trans structure of alkene based on infrared spectra.

Yuxi Zhang Qing Xiong Gang Yang Menglong Li Jing Zhang 《Analytical sciences》2007,23(7):911-915

The application of chemometrics to analyze the information of the cis/trans structure of alkenes in infrared spectra (IR) is introduced. For data from the OMNIC IR spectral database, two feature selection methods, Fisher ratios and genetic algorithm-partial least squares (GA-PLS), and two classification methods, support vector machine (SVM) and probabilistic neural network (PNN), have been used to obtain optimization classifiers. At last, some spectra from other IR databases are used to evaluate the optimization classifiers. It has been demonstrated that both the SVM and PNN optimization classifiers could give preferable predictive results about the cis and trans structures of alkene. 相似文献

20.

Gene selection from microarray data for cancer classification--a machine learning approach 总被引：1，自引：0，他引：1

Wang Y Tetko IV Hall MA Frank E Facius A Mayer KF Mewes HW 《Computational Biology and Chemistry》2005,29(1):1384-46

A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis. 相似文献