首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 2 毫秒
1.
A large-scale similarity search investigation has been carried out on 266 well-defined compound activity classes extracted from the ChEMBL database. The analysis was performed using two widely applied two-dimensional (2D) fingerprints that mark opposite ends of the current performance spectrum of these types of fingerprints, i.e., MACCS structural keys and the extended connectivity fingerprint with bond diameter four (ECFP4). For each fingerprint, three nearest neighbor search strategies were applied. On the basis of these search calculations, a similarity search profile of the ChEMBL database was generated. Overall, the fingerprint search campaign was surprisingly successful. In 203 of 266 test cases (~76%), a compound recovery rate of at least 50% was observed with at least the better performing fingerprint and one search strategy. The similarity search profile also revealed several general trends. For example, fingerprint searching was often characterized by an early enrichment of active compounds in database selection sets. In addition, compound activity classes have been categorized according to different similarity search performance levels, which helps to put the results of benchmark calculations into perspective. Therefore, a compendium of activity classes falling into different search performance categories is provided. On the basis of our large-scale investigation, the performance range of state-of-the-art 2D fingerprinting has been delineated for compound data sets directed against a wide spectrum of pharmaceutical targets.  相似文献   

2.
3.
Publicly available compound activity data have been analyzed to distinguish between compounds for which single or multiple potency measurements were available and gain insight into data confidence levels. Different potency measurements with defined end points and alternative ways to represent multiple potency values for active compounds have been evaluated in the context of SAR analysis. Approximately 78% of all compounds with multiple potency measurements were found to represent high-confidence data, which corresponded to ~10% of all activity data. The use of different types of potency measurements and alternative representations of multiple potency values changed the SAR information content of compound data sets and resulted in different activity cliff distributions. Thus, the types of activity measurements that were available and how they were used substantially impacted SAR analysis. Compounds with multiple K(i) measurements provided the most reliable basis for SAR exploration.  相似文献   

4.
The characterization of structure-activity relationship (SAR) features of large compound data sets has been a hot topic in recent years, and different methods for large-scale SAR analysis have been introduced. The exploration of local SAR components and prioritization of compound subsets have thus far mostly relied on graphical analysis methods that capture similarity and potency relationships in a systematic manner. A currently unsolved problem in large-scale SAR analysis is how to automatically select those compound subsets from large data sets that carry most SAR information. For this purpose, we introduce a numerical optimization scheme that is based on particle swarm optimization guided by an SAR scoring function. The methodology is applied to four large compound sets. We demonstrate that compound subsets representing the most discontinuous local SARs are consistently selected through particle swarm optimization.  相似文献   

5.
The extraction of SAR information from structurally diverse compound data sets is a challenging task. One of the focal points of systematic SAR analysis is the search for activity cliffs, that is, structurally similar compounds having large potency differences, from which SAR determinants can be deduced. The assessment of SAR information is usually based on pairwise similarity and potency comparisons of data set compounds. As a consequence, activity cliffs are mostly evaluated at a compound pair level. Here, we present an extension of the activity cliff concept by introducing "activity ridges" that are formed by overlapping "combinatorial" activity cliffs between participating compounds, giving rise to ridge-like structures in activity landscapes. Activity ridges are rich in SAR information. In a systematic analysis of 242 compound data sets, we have identified well-defined activity ridges in 71 different sets. In addition, an information-theoretic approach has been devised to characterize the structural composition of activity ridges. Taken together, our results show that activity ridges frequently occur in sets of active compounds and that different categories of ridges can be distinguished on the basis of their structural content. The computational identification of activity ridges provides access to compound subsets having high priority for SAR analysis.  相似文献   

6.
7.
An in silico ADME/Tox prediction tool based on substructural analysis has been developed. The tool called SUBSTRUCT has been used to predict CNS activity. Data sets with CNS active and nonactive drugs were extracted from the World Drug Index (WDI). The SUBSTRUCT program predicts CNS activity as good as a much more complicated artificial neural network model. SUBSTRUCT separates the data sets with approximately 80% accuracy. Substructural analysis also shows surprisingly large differences in substructure profiles between CNS active and nonactive drugs.  相似文献   

8.
In recent years classifiers generated with kernel-based methods, such as support vector machines (SVM), Gaussian processes (GP), regularization networks (RN), and binary kernel discrimination (BKD) have been very popular in chemoinformatics data analysis. Aizerman et al. were the first to introduce the notion of employing kernel-based classifiers in the area of pattern recognition. Their original scheme, which they termed the potential function method (PFM), can basically be viewed as a kernel-based perceptron procedure and arguably subsumes the modern kernel-based algorithms. PFM can be computationally much cheaper than modern kernel-based classifiers; furthermore, PFM is far simpler conceptually and easier to implement than the SVM, GP, and RN algorithms. Unfortunately, unlike, e.g., SVM, GP, and RN, PFM is not endowed with both theoretical guarantees and practical strategies to safeguard it against generating overfitting classifiers. This is, in our opinion, the reason why this simple and elegant method has not been taken up in chemoinformatics. In this paper we empirically address this drawback: while maintaining its simplicity, we demonstrate that PFM combined with a simple regularization scheme may yield binary classifiers that can be, in practice, as efficient as classifiers obtained by employing state-of-the-art kernel-based methods. Using a realistic classification example, the augmented PFM was used to generate binary classifiers. Using a large chemical data set, the generalization ability of PFM classifiers were then compared with the prediction power of Laplacian-modified naive Bayesian (LmNB), Winnow (WN), and SVM classifiers.  相似文献   

9.
The tremendous increase of chemical data sets, both in size and number, and the simultaneous desire to speed up the drug discovery process has resulted in an increasing need for a new generation of computational tools that assist in the extraction of information from data and allow for rapid and in-depth data mining. During recent years, visual data mining has become an important tool within the life sciences and drug discovery area with the potential to help avoiding data analysis from turning into a bottleneck. In this paper, we present InfVis, a platform-independent visual data mining tool for chemists, who usually only have little experience with classical data mining tools, for the visualization, exploration, and analysis of multivariate data sets. InfVis represents multidimensional data sets by using intuitive 3D glyph information visualization techniques. Interactive and dynamic tools such as dynamic query devices allow real-time, interactive data set manipulations and support the user in the identification of relationships and patterns. InfVis has been implemented in Java and Java3D and can be run on a broad range of platforms and operating systems. It can also be embedded as an applet in Web-based interfaces. We will present in this paper examples detailing the analysis of a reaction database that demonstrate how InfVis assists chemists in identifying and extracting hidden information.  相似文献   

10.
Summary Three-dimensional (3D)-database searches are now being widely applied to determine potential new active molecules. Many structural data sets obtained as a result of these searches are still large in size. In this paper we apply molecular similarity calculations as a rapid method to screen two such data sets. In the first investigation, synthetic candidates, produced as a result of a tendamistat -turn mimic search, were tested for their ability to imitate the -turn backbone. In the second study, structures extracted through a histamine pharmacophore query search were examined on the basis of their electronic similarity to histamine. Molecular similarity is shown to provide a rapid means of gaining insight into the composition of molecular data sets, with possible implications for future full 3D-database searches.  相似文献   

11.
Abstract

The increased acceptance of SAR approaches to hazard identification has led us to investigate methods to improve the predictive performance of SAR models. In the present study we demonstrate that although on theoretical grounds the ratio of active to inactive chemicals in the learning set should be unity, SAR models can ?tolerate‘ an unbalanced range in ratios from 3 : 1 (i.e., 75% actives) to 1 : 2 (i.e., 33% actives) and still perform adequately. On the other hand SAR models derived from learning sets with ratios in excess of 4 : 1 (80% actives), even when corrected for the initial ratio do not perform satisfactorily.  相似文献   

12.
Colloidal particles are used to characterize microscopic potential landscapes, which are defined on a sample surface and arise in ensembles of particles. The positions of the particles are recorded using video microscopy. Analysis of the positions, which the particles occupy during their Brownian motion, yields the exact shape of the surface potential, in which the particles move. The underlying principle of our measurements is well-known from measurements using total internal reflection microscopy; in contrast to these measurements, our scheme can be expanded to measurements of inter-particle interactions. As an example, we demonstrate the measurement of interactions between two magnetic particles, sedimenting towards a potential barrier in a tilted geometry.  相似文献   

13.
An activity landscape model of a compound data set can be rationalized as a graphical representation that integrates molecular similarity and potency relationships. Activity landscape representations of different design are utilized to aid in the analysis of structure-activity relationships and the selection of informative compounds. Activity landscape models reported thus far focus on a single target (i.e., a single biological activity) or at most two targets, giving rise to selectivity landscapes. For compounds active against more than two targets, landscapes representing multitarget activities are difficult to conceptualize and have not yet been reported. Herein, we present a first activity landscape design that integrates compound potency relationships across multiple targets in a formally consistent manner. These multitarget activity landscapes are based on a general activity cliff classification scheme and are visualized in graph representations, where activity cliffs are represented as edges. Furthermore, the contributions of individual compounds to structure-activity relationship discontinuity across multiple targets are monitored. The methodology has been applied to derive multitarget activity landscapes for compound data sets active against different target families. The resulting landscapes identify single-, dual-, and triple-target activity cliffs and reveal the presence of hierarchical cliff distributions. From these multitarget activity landscapes, compounds forming complex activity cliffs can be readily selected.  相似文献   

14.
Since the introduction of NMR prediction software, medicinal chemists have imagined submitting their compounds to corporate compound registration systems that would ultimately display a simplified pass/fail result. We initially implemented such a system based on HPLC and liquid chromatography mass spectrometry (LCMS) data that is embedded within our industry standard sample submission and registration process. By using gradient-heteronuclear single quantum coherence (HSQC) experiments, we have extended this concept to NMR data through a comparison of experimentally acquired data against predicted (1)H and (13)C NMR data. Integration of our compound registration system with our analytical instruments now provides our chemists unattended and automated NMR verification for collections of submitted compounds. The benefits achieved from automated processing and interpretation of results produced enhanced confidence in our compound library and released the chemists from the tedium of manipulating large amounts of data. This allows scientists to focus more of their attention to the drug discovery process.  相似文献   

15.
Scoring the activity of compounds in phenotypic high-throughput assays presents a unique challenge because of the limited resolution and inherent measurement error of these assays. Techniques that leverage the structural similarity of compounds within an assay can be used to improve the hit-recovery rate from screening data. A technique is presented that uses clustering and sampling statistics to predict likely compound activity by scoring entire structural classes. A set of phenotypic assays performed against a commercially available compound library was used as a test set. Using the class-scoring technique, the resultant activity prediction scores were more reproducible than individual assay measurements, and class scoring recovered known active compounds more efficiently than individual assay measurements because class scoring had fewer false positives. Known biologically active compounds were recovered 87% of the time using class scores, suggesting a low false-negative rate that compared well to individual assay measurements. In addition, many weak and potentially novel classes of active compounds, overlooked by individual assay measurements, were suggested.  相似文献   

16.
17.
The evaluation of the scaffold hopping potential of computational methods is of high relevance for virtual screening. For benchmark calculations, classes of known active compounds are utilized. Ideally, such classes should have a well-defined content of structurally diverse scaffolds. However, in reported benchmark investigations, the choice of activity classes is often difficult to rationalize. To provide a compendium of well-characterized test cases for the assessment of scaffold hopping potential, structural distances between scaffolds were systematically calculated for compound classes available in the ChEMBL database. Nearly seven million scaffold pairs were evaluated. On the basis of the global scaffold distance distribution, a threshold value for large scaffold distances was determined. Compound data sets were ranked based on the proportion of scaffold pairs with large distances they contained, taking additional criteria into account that are relevant for virtual screening. A set of 50 activity classes is provided that represent attractive test cases for scaffold hopping analysis and benchmark calculations.  相似文献   

18.
19.
20.
Three‐level versions of Multilevel Simultaneous Component Analysis (MLSCA) and Multilevel Partial Least Squares (MLPLS) were developed, which are capable of separating between‐plant, between‐run and within‐run process variation, and modeling these three levels in a multivariate way. In comparison to the two‐level versions they allow to discriminate between overall differences between plants and the variation between runs within a plant. It was shown that the three‐level version of MLSCA has clear added value for the analysis of process runs from different plants. In MLPLS other projections of the multivariate data onto latent variables and different views of the data are obtained when relevant Y information is available. This has clear added value for obtaining insight into the relation between process data and Y. A special use of MLPLS is to diagnose aberrations in first principles models. In batch process monitoring MLSCA at three levels allows simultaneous multivariate modelling of batch data from different manufacturing plants. By filtering out the between‐plant and between‐run sources of variation, and using only within‐run variation, monitoring models can be improved. Using within‐run data, it is possible to build monitoring models across manufacturing units and reduce the number of nuisance alarms, while improving abnormal situation detection and diagnosis. Model transfer is only possible if static between‐plant differences exist, but not if there are dynamic differences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号