首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
2.
To explore the pathogenic mechanisms of MicroRNA (miRNA) on diverse diseases, many researchers have concentrated on discovering the potential associations between miRNA and disease using machine learning methods. However, the prediction accuracy of supervised machine learning methods is limited by lacking of experimentally-validated uncorrelated miRNA-disease pairs. Without these negative samples, training a highly accurate model is much more difficult. Different from traditional miRNA-disease prediction models using randomly selected unknown samples as negative training samples, we propose an ensemble learning framework to solve this positive-unlabeled (PU) learning problem. The framework incorporates two steps, i.e., a novel semi-supervised Kmeans (SS-Kmeans) to extract reliable negative samples from unknown miRNA-disease pairs and subagging method to generate diverse training sample sets to make full use of those reliable negative samples for ensemble learning. Combined with effective random vector functional link (RVFL) network as prediction model, the proposed framework showed superior prediction accuracy comparing with other popular approaches. A case study on lung and gastric neoplasms further confirms the framework’s efficacy at identifying miRNA disease associations.  相似文献   

3.
We introduce the QuanSA method for inducing physically meaningful field-based models of ligand binding pockets based on structure-activity data alone. The method is closely related to the QMOD approach, substituting a learned scoring field for a pocket constructed of molecular fragments. The problem of mutual ligand alignment is addressed in a general way, and optimal model parameters and ligand poses are identified through multiple-instance machine learning. We provide algorithmic details along with performance results on sixteen structure-activity data sets covering many pharmaceutically relevant targets. In particular, we show how models initially induced from small data sets can extrapolatively identify potent new ligands with novel underlying scaffolds with very high specificity. Further, we show that combining predictions from QuanSA models with those from physics-based simulation approaches is synergistic. QuanSA predictions yield binding affinities, explicit estimates of ligand strain, associated ligand pose families, and estimates of structural novelty and confidence. The method is applicable for fine-grained lead optimization as well as potent new lead identification.  相似文献   

4.
Virtual screening—predicting which compounds within a specified compound library bind to a target molecule, typically a protein—is a fundamental task in the field of drug discovery. Doing virtual screening well provides tangible practical benefits, including reduced drug development costs, faster time to therapeutic viability, and fewer unforeseen side effects. As with most applied computational tasks, the algorithms currently used to perform virtual screening feature inherent tradeoffs between speed and accuracy. Furthermore, even theoretically rigorous, computationally intensive methods may fail to account for important effects relevant to whether a given compound will ultimately be usable as a drug. Here we investigate the virtual screening performance of the recently released Gnina molecular docking software, which uses deep convolutional networks to score protein-ligand structures. We find, on average, that Gnina outperforms conventional empirical scoring. The default scoring in Gnina outperforms the empirical AutoDock Vina scoring function on 89 of the 117 targets of the DUD-E and LIT-PCBA virtual screening benchmarks with a median 1% early enrichment factor that is more than twice that of Vina. However, we also find that issues of bias linger in these sets, even when not used directly to train models, and this bias obfuscates to what extent machine learning models are achieving their performance through a sophisticated interpretation of molecular interactions versus fitting to non-informative simplistic property distributions.  相似文献   

5.
Very large data sets of molecules screened against a broad range of targets have become available due to the advent of combinatorial chemistry. This information has led to the realization that ADME (absorption, distribution, metabolism, and excretion) and toxicity issues are important to consider prior to library synthesis. Furthermore, these large data sets provide a unique and important source of information regarding what types of molecular shapes may interact with specific receptor or target classes. Thus, the requirement for rapid and accurate data mining tools became paramount. To address these issues Pharmacopeia, Inc. formed a computational research group, The Center for Informatics and Drug Discovery (CIDD).* In this review we cover the work done by this group to address both in silico ADME modeling and data mining issues faced by Pharmacopeia because of the availability of a large and diverse collection (over 6 million discrete compounds) of drug-like molecules. In particular, in the data mining arena we discuss rapid docking tools and how we employ them, and we describe a novel data mining tool based on a ID representation of a molecule followed by a molecular sequence alignment step. For the ADME area we discuss the development and application of absorption, blood-brain barrier (BBB) and solubility models. Finally, we summarize the impact the tools and approaches might have on the drug discovery process.  相似文献   

6.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

7.
The extraction of SAR information from structurally diverse compound data sets is a challenging task. One of the focal points of systematic SAR analysis is the search for activity cliffs, that is, structurally similar compounds having large potency differences, from which SAR determinants can be deduced. The assessment of SAR information is usually based on pairwise similarity and potency comparisons of data set compounds. As a consequence, activity cliffs are mostly evaluated at a compound pair level. Here, we present an extension of the activity cliff concept by introducing "activity ridges" that are formed by overlapping "combinatorial" activity cliffs between participating compounds, giving rise to ridge-like structures in activity landscapes. Activity ridges are rich in SAR information. In a systematic analysis of 242 compound data sets, we have identified well-defined activity ridges in 71 different sets. In addition, an information-theoretic approach has been devised to characterize the structural composition of activity ridges. Taken together, our results show that activity ridges frequently occur in sets of active compounds and that different categories of ridges can be distinguished on the basis of their structural content. The computational identification of activity ridges provides access to compound subsets having high priority for SAR analysis.  相似文献   

8.
9.
10.
11.
12.
Methods that can screen large databases to retrieve a structurally diverse set of compounds with desirable bioactivity properties are critical in the drug discovery and development process. This paper presents a set of such methods that are designed to find compounds that are structurally different to a certain query compound while retaining its bioactivity properties (scaffold hops). These methods utilize various indirect ways of measuring the similarity between the query and a compound that take into account additional information beyond their structure-based similarities. The set of techniques that are presented capture these indirect similarities using approaches based on analyzing the similarity network formed by the query and the database compounds. Experimental evaluation shows that most of these methods substantially outperform previously developed approaches both in terms of their ability to identify structurally diverse active compounds as well as active compounds in general.  相似文献   

13.
14.
Similar to advancements gained from big data in genomics, security, internet of things, and e-commerce, the materials workflow could be made more efficient and prolific through advances in streamlining data sources, autonomous materials synthesis, rapid characterization, big data analytics, and self-learning algorithms. In electrochemical materials science, data sets are large, unstructured/heterogeneous, and difficult to process and analyze from a single data channel or platform. Computer-aided materials design together with advances in data mining, machine learning, and predictive analytics are expected to provide inexpensive and accelerated pathways towards tailor-made functionally optimized energy materials. Fundamental research in the field of electrochemical energy materials focuses primarily on complex interfacial phenomena and kinetic electrocatalytic processes. This perspective article critically assesses AI-driven modeling and computational approaches that are currently applied to those objects. An application-driven materials intelligence platform is introduced, and its functionalities are scrutinized considering the development of electrocatalyst materials for CO2 conversion as a use case.  相似文献   

15.
16.
17.
18.
Protein function prediction is a crucial task in the post-genomics era due to their diverse irreplaceable roles in a biological system. Traditional methods involved cost-intensive and time-consuming molecular biology techniques but they proved to be ineffective after the outburst of sequencing data through the advent of cost-effective and advanced sequencing techniques. To manage the pace of annotation with that of data generation, there is a shift to computational approaches which are based on homology, sequence and structure-based features, protein-protein interaction networks, phylogenetic profiles, and physicochemical properties, etc. A combination of these features has proven to be promising for protein function prediction in terms of improving prediction accuracy. In the present work, we have employed a combination of features based on sequence, physicochemical property, subsequence and annotation features with a total of 9890 features extracted and/or calculated for 171,212 reviewed prokaryotic proteins of 9 bacterial phyla from UniProtKB, to train a supervised deep learning ensemble model with the aim to categorize a bacterial hypothetical/unreviewed protein’s function into 1739 GO terms as functional classes. The proposed system being fully dedicated to bacterial organisms is a novel attempt amongst various existing machine learning based protein function prediction systems based on mixed organisms. Experimental results demonstrate the success of the proposed deep learning ensemble model based on deep neural network method with F1 measure of 0.7912 on the prepared Test dataset 1 of reviewed proteins.  相似文献   

19.
The process of Drug Discovery is a complex and high risk endeavor that requires focused attention on experimental hypotheses, the application of diverse sets of technologies and data to facilitate high quality decision-making. All is aimed at enhancing the quality of the chemical development candidate(s) through clinical evaluation and into the market. In support of the lead generation and optimization phases of this endeavor, high throughput technologies such as combinatorial/high throughput synthesis and high throughput and ultra-high throughput screening, have allowed the rapid analysis and generation of large number of compounds and data. Today, for every analog synthesized 100 or more data points can be collected and captured in various centralized databases. The analysis of thousands of compounds can very quickly become a daunting task. In this article we present the process we have developed for both analyzing and prioritizing large sets of data starting from diversity and focused uHTS in support of lead generation and secondary screens supporting lead optimization. We will describe how we use informatics and computational chemistry to focus our efforts on asking relevant questions about the desired attributes of a specific library, and subsequently in guiding the generation of more information-rich sets of analogs in support of both processes.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号