期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning

Hayley Weir Keiran Thompson Amelia Woodward Benjamin Choi Augustin Braun Todd J. Martínez 《Chemical science》2021,12(31):10622

Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition tool designed to remove these barriers. A neural image captioning approach consisting of a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder learned a mapping from photographs of hand-drawn hydrocarbon structures to machine-readable SMILES representations. We generated a large auxiliary training dataset, based on RDKit molecular images, by combining image augmentation, image degradation and background addition. Additionally, a small dataset of ∼600 hand-drawn hydrocarbon chemical structures was crowd-sourced using a phone web application. These datasets were used to train the image-to-SMILES neural network with the goal of maximizing the hand-drawn hydrocarbon recognition accuracy. By forming a committee of the trained neural networks where each network casts one vote for the predicted molecule, we achieved a nearly 10 percentage point improvement of the molecule recognition accuracy and were able to assign a confidence value for the prediction based on the number of agreeing votes. The ensemble model achieved an accuracy of 76% on hand-drawn hydrocarbons, increasing to 86% if the top 3 predictions were considered.

Offline recognition of hand-drawn hydrocarbon structures is learned using an image-to-SMILES neural network through the application of synthetic data generation and ensemble learning. 相似文献

2.

Img2Mol – accurate SMILES recognition from molecular graphical depictions

Djork-Arn Clevert Tuan Le Robin Winter Floriane Montanari 《Chemical science》2021,12(42):14174

The automatic recognition of the molecular content of a molecule''s graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. Recent advances in neural machine translation enable the auto-encoding of molecular structures in a continuous vector space of fixed size (latent representation) with low reconstruction errors. In this paper, we present a fast and accurate model combining deep convolutional neural network learning from molecule depictions and a pre-trained decoder that translates the latent representation into the SMILES representation of the molecules. This combination allows us to precisely infer a molecular structure from an image. Our rigorous evaluation shows that Img2Mol is able to correctly translate up to 88% of the molecular depictions into their SMILES representation. A pretrained version of Img2Mol is made publicly available on GitHub for non-commercial users.

The automatic recognition of the molecular content of a molecule''s graphical depiction is an extremely challenging problem that remains largely unsolved despite decades of research. 相似文献

3.

Model agnostic generation of counterfactual explanations for molecules

Geemi P. Wellawatte Aditi Seshadri Andrew D. White 《Chemical science》2022,13(13):3697

An outstanding challenge in deep learning in chemistry is its lack of interpretability. The inability of explaining why a neural network makes a prediction is a major barrier to deployment of AI models. This not only dissuades chemists from using deep learning predictions, but also has led to neural networks learning spurious correlations that are difficult to notice. Counterfactuals are a category of explanations that provide a rationale behind a model prediction with satisfying properties like providing chemical structure insights. Yet, counterfactuals have been previously limited to specific model architectures or required reinforcement learning as a separate process. In this work, we show a universal model-agnostic approach that can explain any black-box model prediction. We demonstrate this method on random forest models, sequence models, and graph neural networks in both classification and regression.

Generating model agnostic molecular counterfactual explanations to explain model predictions. 相似文献

4.

A transferable active-learning strategy for reactive molecular force fields

Tom A. Young Tristan Johnston-Wood Volker L. Deringer Fernanda Duarte 《Chemical science》2021,12(32):10944

Predictive molecular simulations require fast, accurate and reactive interatomic potentials. Machine learning offers a promising approach to construct such potentials by fitting energies and forces to high-level quantum-mechanical data, but doing so typically requires considerable human intervention and data volume. Here we show that, by leveraging hierarchical and active learning, accurate Gaussian Approximation Potential (GAP) models can be developed for diverse chemical systems in an autonomous manner, requiring only hundreds to a few thousand energy and gradient evaluations on a reference potential-energy surface. The approach uses separate intra- and inter-molecular fits and employs a prospective error metric to assess the accuracy of the potentials. We demonstrate applications to a range of molecular systems with relevance to computational organic chemistry: ranging from bulk solvents, a solvated metal ion and a metallocage onwards to chemical reactivity, including a bifurcating Diels–Alder reaction in the gas phase and non-equilibrium dynamics (a model S_N2 reaction) in explicit solvent. The method provides a route to routinely generating machine-learned force fields for reactive molecular systems.

An efficient strategy for training Gaussian Approximation Potential (GAP) models to study chemical reactions using hierarchical and active learning. 相似文献

5.

MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction

Ziduo Yang Weihe Zhong Lu Zhao Calvin Yu-Chian Chen 《Chemical science》2022,13(3):816

Predicting drug–target affinity (DTA) is beneficial for accelerating drug discovery. Graph neural networks (GNNs) have been widely used in DTA prediction. However, existing shallow GNNs are insufficient to capture the global structure of compounds. Besides, the interpretability of the graph-based DTA models highly relies on the graph attention mechanism, which can not reveal the global relationship between each atom of a molecule. In this study, we proposed a deep multiscale graph neural network based on chemical intuition for DTA prediction (MGraphDTA). We introduced a dense connection into the GNN and built a super-deep GNN with 27 graph convolutional layers to capture the local and global structure of the compound simultaneously. We also developed a novel visual explanation method, gradient-weighted affinity activation mapping (Grad-AAM), to analyze a deep learning model from the chemical perspective. We evaluated our approach using seven benchmark datasets and compared the proposed method to the state-of-the-art deep learning (DL) models. MGraphDTA outperforms other DL-based approaches significantly on various datasets. Moreover, we show that Grad-AAM creates explanations that are consistent with pharmacologists, which may help us gain chemical insights directly from data beyond human perception. These advantages demonstrate that the proposed method improves the generalization and interpretation capability of DTA prediction modeling.

MGraphDTA is designed to capture the local and global structure of a compound simultaneously for drug–target affinity prediction and can provide explanations that are consistent with pharmacologists. 相似文献

6.

Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks

Alex K. Chew Shengli Jiang Weiqi Zhang Victor M. Zavala Reid C. Van Lehn 《Chemical science》2020,11(46):12464

The rates of liquid-phase, acid-catalyzed reactions relevant to the upgrading of biomass into high-value chemicals are highly sensitive to solvent composition and identifying suitable solvent mixtures is theoretically and experimentally challenging. We show that the complex atomistic configurations of reactant–solvent environments generated by classical molecular dynamics simulations can be exploited by 3D convolutional neural networks to enable accurate predictions of Brønsted acid-catalyzed reaction rates for model biomass compounds. We develop a 3D convolutional neural network, which we call SolventNet, and train it to predict acid-catalyzed reaction rates using experimental reaction data and corresponding molecular dynamics simulation data for seven biomass-derived oxygenates in water–cosolvent mixtures. We show that SolventNet can predict reaction rates for additional reactants and solvent systems an order of magnitude faster than prior simulation methods. This combination of machine learning with molecular dynamics enables the rapid, high-throughput screening of solvent systems and identification of improved biomass conversion conditions.

Solvent-mediated, acid-catalyzed reaction rates relevant to the upgrading of biomass into high-value chemicals are accurately predicted using a combination of molecular dynamics simulations and 3D convolutional neural networks. 相似文献

7.

Physically inspired deep learning of molecular excitations and photoemission spectra

Julia Westermayr Reinhard J. Maurer 《Chemical science》2021,12(32):10755

Modern functional materials consist of large molecular building blocks with significant chemical complexity which limits spectroscopic property prediction with accurate first-principles methods. Consequently, a targeted design of materials with tailored optoelectronic properties by high-throughput screening is bound to fail without efficient methods to predict molecular excited-state properties across chemical space. In this work, we present a deep neural network that predicts charged quasiparticle excitations for large and complex organic molecules with a rich elemental diversity and a size well out of reach of accurate many body perturbation theory calculations. The model exploits the fundamental underlying physics of molecular resonances as eigenvalues of a latent Hamiltonian matrix and is thus able to accurately describe multiple resonances simultaneously. The performance of this model is demonstrated for a range of organic molecules across chemical composition space and configuration space. We further showcase the model capabilities by predicting photoemission spectra at the level of the GW approximation for previously unseen conjugated molecules.

A physically-inspired machine learning model for orbital energies is developed that can be augmented with delta learning to obtain photoemission spectra, ionization potentials, and electron affinities with experimental accuracy. 相似文献

8.

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Mingjian Wen Samuel M. Blau Xiaowei Xie Shyam Dwaraknath Kristin A. Persson 《Chemical science》2022,13(5):1446

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

Contrastive pretraining of chemical reactions by matching augmented reaction representations to improve machine learning performance on small reaction datasets. 相似文献

9.

Deep generative design with 3D pharmacophoric constraints

Fergus Imrie Thomas E. Hadfield Anthony R. Bradley Charlotte M. Deane 《Chemical science》2021,12(43):14577

Generative models have increasingly been proposed as a solution to the molecular design problem. However, it has proved challenging to control the design process or incorporate prior knowledge, limiting their practical use in drug discovery. In particular, generative methods have made limited use of three-dimensional (3D) structural information even though this is critical to binding. This work describes a method to incorporate such information and demonstrates the benefit of doing so. We combine an existing graph-based deep generative model, DeLinker, with a convolutional neural network to utilise physically-meaningful 3D representations of molecules and target pharmacophores. We apply our model, DEVELOP, to both linker and R-group design, demonstrating its suitability for both hit-to-lead and lead optimisation. The 3D pharmacophoric information results in improved generation and allows greater control of the design process. In multiple large-scale evaluations, we show that including 3D pharmacophoric constraints results in substantial improvements in the quality of generated molecules. On a challenging test set derived from PDBbind, our model improves the proportion of generated molecules with high 3D similarity to the original molecule by over 300%. In addition, DEVELOP recovers 10× more of the original molecules compared to the baseline DeLinker method. Our approach is general-purpose, readily modifiable to alternate 3D representations, and can be incorporated into other generative frameworks. Code is available at https://github.com/oxpig/DEVELOP.

A novel deep generative model combines convolution and graph neural networks to allow 3D-aware molecular design. We show how 3D pharmacophoric information can be incorporated into generative models and apply our model to both linker and R-group design. 相似文献

10.

DeepReac+: deep active learning for quantitative modeling of organic chemical reactions

Yukang Gong Dongyu Xue Guohui Chuai Jing Yu Qi Liu 《Chemical science》2021,12(43):14459

Various computational methods have been developed for quantitative modeling of organic chemical reactions; however, the lack of universality as well as the requirement of large amounts of experimental data limit their broad applications. Here, we present DeepReac+, an efficient and universal computational framework for prediction of chemical reaction outcomes and identification of optimal reaction conditions based on deep active learning. Under this framework, DeepReac is designed as a graph-neural-network-based model, which directly takes 2D molecular structures as inputs and automatically adapts to different prediction tasks. In addition, carefully-designed active learning strategies are incorporated to substantially reduce the number of necessary experiments for model training. We demonstrate the universality and high efficiency of DeepReac+ by achieving the state-of-the-art results with a minimum of labeled data on three diverse chemical reaction datasets in several scenarios. Collectively, DeepReac+ has great potential and utility in the development of AI-aided chemical synthesis. DeepReac+ is freely accessible at https://github.com/bm2-lab/DeepReac.

Based on GNNs and active learning, DeepReac+ is designed as a universal framework for quantitative modeling of chemical reactions. It takes molecular structures as inputs directly and adapts to various prediction tasks with fewer training data. 相似文献

11.

Machine learning of solvent effects on molecular spectra and reactions

Michael Gastegger Kristof T. Schütt Klaus-Robert Müller 《Chemical science》2021,12(34):11473

Fast and accurate simulation of complex chemical systems in environments such as solutions is a long standing challenge in theoretical chemistry. In recent years, machine learning has extended the boundaries of quantum chemistry by providing highly accurate and efficient surrogate models of electronic structure theory, which previously have been out of reach for conventional approaches. Those models have long been restricted to closed molecular systems without accounting for environmental influences, such as external electric and magnetic fields or solvent effects. Here, we introduce the deep neural network FieldSchNet for modeling the interaction of molecules with arbitrary external fields. FieldSchNet offers access to a wealth of molecular response properties, enabling it to simulate a wide range of molecular spectra, such as infrared, Raman and nuclear magnetic resonance. Beyond that, it is able to describe implicit and explicit molecular environments, operating as a polarizable continuum model for solvation or in a quantum mechanics/molecular mechanics setup. We employ FieldSchNet to study the influence of solvent effects on molecular spectra and a Claisen rearrangement reaction. Based on these results, we use FieldSchNet to design an external environment capable of lowering the activation barrier of the rearrangement reaction significantly, demonstrating promising venues for inverse chemical design.

A machine learning approach for modeling the influence of external environments and fields on molecules has been developed, which allows the prediction of various types of molecular spectra in vacuum and under implicit and explicit solvation. 相似文献

12.

Metabolite profiling and beyond: approaches for the rapid processing and annotation of human blood serum mass spectrometry data

Jan Stanstrup Michael Gerlich Lars Ove Dragsted Steffen Neumann 《Analytical and bioanalytical chemistry》2013,405(15):5037-5048

In this paper, we describe data processing and metabolite identification approaches which lead to a rapid and semi-automated interpretation of metabolomics experiments. Data from metabolite fingerprinting using LC-ESI-Q-TOF/MS were processed with several open-source software packages, including XCMS and CAMERA to detect features and group features into compound spectra. Next, we describe the automatic scheduling of tandem mass spectrometry (MS) acquisitions to acquire a large number of MS/MS spectra, and the subsequent processing and computer-assisted annotation towards identification using the R packages MetShot, Rdisop, and the MetFusion application. We also implement a simple retention time prediction model using predicted lipophilicity logD, which predicts retention times within 42 s (6 min gradient) for most compounds in our setup. We putatively identified 44 common metabolites including several amino acids and phospholipids at metabolomics standards initiative (MSI) levels two and three and confirmed the majority of them by comparison with authentic standards at MSI level one. To aid both data integration within and data sharing between laboratories, we integrated data from two labs and mapped retention times between the chromatographic systems. Despite the different MS instrumentation and different chromatographic gradient programs, the mapped retention times agree within 26 s (20 min gradient) for 90 % of the mapped features.