首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tversky's measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size |D| and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(|D|0.6) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.  相似文献   

2.
Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.  相似文献   

3.
In many modern chemoinformatics systems, molecules are represented by long binary fingerprint vectors recording the presence or absence of particular features or substructures, such as labeled paths or trees, in the molecular graphs. These long fingerprints are often compressed to much shorter fingerprints using a simple modulo operation. As the length of the fingerprints decreases, their typical density and overlap tend to increase, and so does any similarity measure based on overlap, such as the widely used Tanimoto similarity. Here we show that this correlation between shorter fingerprints and higher similarity can be thought of as a systematic error introduced by the fingerprint folding algorithm and that this systematic error can be corrected mathematically. More precisely, given two molecules and their compressed fingerprints of a given length, we show how a better estimate of their uncompressed overlap, hence of their similarity, can be derived to correct for this bias. We show how the correction can be implemented not only for the Tanimoto measure but also for all other commonly used measures. Experiments on various data sets and fingerprint sizes demonstrate how, with a negligible computational overhead, the correction noticeably improves the sensitivity and specificity of chemical retrieval.  相似文献   

4.
A statistical approach named the conditional correlated Bernoulli model is introduced for modeling of similarity scores and predicting the potential of fingerprint search calculations to identify active compounds. Fingerprint features are rationalized as dependent Bernoulli variables and conditional distributions of Tanimoto similarity values of database compounds given a reference molecule are assessed. The conditional correlated Bernoulli model is utilized in the context of virtual screening to estimate the position of a compound obtaining a certain similarity value in a database ranking. Through the generation of receiver operating characteristic curves from cumulative distribution functions of conditional similarity values for known active and random database compounds, one can predict how successful a fingerprint search might be. The comparison of curves for different fingerprints makes it possible to identify fingerprints that are most likely to identify new active molecules in a database search given a set of known reference molecules.  相似文献   

5.
6.
Fingerprint scaling is a method to increase the performance of similarity search calculations. It is based on the detection of bit patterns in keyed fingerprints that are signatures of specific compound classes. Application of scaling factors to consensus bits that are mostly set on emphasizes signature bit patterns during similarity searching and has been shown to improve search results for different fingerprints. Similarity search profiling has recently been introduced as a method to analyze similarity search calculations. Profiles separately monitor correctly identified hits and other detected database compounds as a function of similarity threshold values and make it possible to estimate whether virtual screening calculations can be successful or to evaluate why they fail. This similarity search profile technique has been applied here to study fingerprint scaling in detail and better understand effects that are responsible for its performance. In particular, we have focused on the qualitative and quantitative analysis of similarity search profiles under scaling conditions. Therefore, we have carried out systematic similarity search calculations for 23 biological activity classes under scaling conditions over a wide range of scaling factors in a compound database containing approximately 1.3 million molecules and monitored these calculations in similarity search profiles. Analysis of these profiles confirmed increases in hit rates as a consequence of scaling and revealed that scaling influences similarity search calculations in different ways. Based on scaled similarity search profiles, compound sets could be divided into different categories. In a number of cases, increases in search performance under scaling conditions were due to a more significant relative increase in correctly identified hits than detected false-positives. This was also consistent with the finding that preferred similarity threshold values increased due to fingerprint scaling, which was well illustrated by similarity search profiling.  相似文献   

7.
8.
9.
Similarity searching using molecular fingerprints is a widely used approach for the identification of novel hits. A fingerprint search involves many pairwise comparisons of bit string representations of known active molecules with those precomputed for database compounds. Bit string overlap, as evaluated by various similarity metrics, is used as a measure of molecular similarity. Results of a number of studies focusing on fingerprints suggest that it is difficult, if not impossible, to develop generally applicable search parameters and strategies, irrespective of the compound classes under investigation. Rather, more or less, each individual search problem requires an adjustment of calculation conditions. Thus, there is a need for diagnostic tools to analyze fingerprint-based similarity searching. We report an analysis of fingerprint search calculations on different sets of structurally diverse active compounds. Calculations on five biological activity classes were carried out with two fingerprints in two compound source databases, and the results were analyzed in histograms. Tanimoto coefficient (Tc) value ranges where active compounds were detected were compared to the distribution of Tc values in the database. The analysis revealed that compound class-specific effects strongly influenced the outcome of these fingerprint calculations. Among the five diverse compound sets studied, very different search results were obtained. The analysis described here can be applied to determine Tc intervals where scaffold hopping occurs. It can also be used to benchmark fingerprint calculations or estimate their probability of success.  相似文献   

10.
The ever growing size of chemical databases calls for the development of novel methods for representing and comparing molecules. One such method called LINGO is based on fragmenting the SMILES string representation of molecules. Comparison of molecules can then be performed by calculating the Tanimoto coefficient, which is called LINGOsim when used on LINGO multisets. This paper introduces a verbose representation for storing LINGO multisets, which makes it possible to transform them into sparse fingerprints such that fingerprint data structures and algorithms can be used to accelerate queries. The previous best method for rapidly calculating the LINGOsim similarity matrix required specialized hardware to yield a significant speedup over existing methods. By representing LINGO multisets in the verbose representation and using inverted indices, it is possible to calculate LINGOsim similarity matrices roughly 2.6 times faster than existing methods without relying on specialized hardware.  相似文献   

11.
An analysis method termed similarity search profiling has been developed to evaluate fingerprint-based virtual screening calculations. The analysis is based on systematic similarity search calculations using multiple template compounds over the entire value range of a similarity coefficient. In graphical representations, numbers of correctly identified hits and other detected database compounds are separately monitored. The resulting profiles make it possible to determine whether a virtual screening trial can in principle succeed for a given compound class, search tool, similarity metric, and selection criterion. As a test case, we have analyzed virtual screening calculations using a recently designed fingerprint on 23 different biological activity classes in a compound source database containing approximately 1.3 million molecules. Based on our predefined selection criteria, we found that virtual screening analysis was successful for 19 of 23 compound classes. Profile analysis also makes it possible to determine compound class-specific similarity threshold values for similarity searching.  相似文献   

12.
13.
The utility of chemoinformatics systems depends on the accurate computer representation and efficient manipulation of chemical compounds. In such systems, a small molecule is often digitized as a large fingerprint vector, where each element indicates the presence/absence or the number of occurrences of a particular structural feature. Since in theory the number of unique features can be exceedingly large, these fingerprint vectors are usually folded into much shorter ones using hashing and modulo operations, allowing fast "in-memory" manipulation and comparison of molecules. There is increasing evidence that lossless fingerprints can substantially improve retrieval performance in chemical database searching (substructure or similarity), which have led to the development of several lossless fingerprint compression algorithms. However, any gains in storage and retrieval afforded by compression need to be weighed against the extra computational burden required for decompression before these fingerprints can be compared. Here we demonstrate that graphics processing units (GPU) can greatly alleviate this problem, enabling the practical application of lossless fingerprints on large databases. More specifically, we show that, with the help of a ~$500 ordinary video card, the entire PubChem database of ~32 million compounds can be searched in ~0.2-2 s on average, which is 2 orders of magnitude faster than a conventional CPU. If multiple query patterns are processed in batch, the speedup is even more dramatic (less than 0.02-0.2 s/query for 1000 queries). In the present study, we use the Elias gamma compression algorithm, which results in a compression ratio as high as 0.097.  相似文献   

14.
Differences in molecular complexity and size are known to bias the evaluation of fingerprint similarity. For example, complex molecules tend to produce fingerprints with higher bit density than simple ones, which often leads to artificially high similarity values in search calculations. We introduce here a variant of the Tversky coefficient that makes it possible to modulate or eliminate molecular complexity effects when evaluating fingerprint similarity. This has enabled us to study in detail the role of molecular complexity in similarity searching and the relationship between reference and active database compounds. Balancing complexity effects leads to constant distributions of similarity values for reference and database molecules, independent of how compound contributions are weighted. When searching for active compounds with varying complexity, hit rates can be optimized by modulating complexity effects, rather than eliminating them, and adjusting relative compound weights. For reference molecules and active database compounds having different complexity, preferred parameter settings are identified.  相似文献   

15.
16.
Similarity searches using combinations of seven different similarity coefficients and six different representations have been carried out on the Dictionary of Natural Products database. The objective was to discover if any special methods of searching apply to this database, which is very different in nature from the many synthetic databases that have been the subject of previous studies of similarity searching. Search effectiveness was assessed by a recall analysis of the search outputs from sets of pharmacologically active target structures. The different target sets produce exceptional but contradictory results for the Russell-Rao and Forbes coefficients, which have been shown to be due to a dependence on molecular size; these are the coefficients of choice in the case of large and small structures, respectively. Rankings from these results have been combined using a data fusion scheme and some small gains in performance were normally obtained by using substructural fingerprints and molecular holograms in combination with the Squared Euclidean or Tanimoto coefficients.  相似文献   

17.
18.
19.
Similarity by compression   总被引:1,自引:0,他引:1  
We present a simple and effective method for similarity searching in virtual high-throughput screening, requiring only a string-based representation of the molecules (e.g., SMILES) and standard compression software, available on all modern desktop computers. This method utilizes the normalized compression distance, an approximation of the normalized information distance, based on the concept of Kolmogorov complexity. On representative data sets, we demonstrate that compression-based similarity searching can outperform standard similarity searching protocols, exemplified by the Tanimoto coefficient combined with a binary fingerprint representation and data fusion. Software to carry out compression-based similarity is available from our Web site at http://comp.chem.nottingham.ac.uk/download/zippity.  相似文献   

20.
Similarity of compound chemical structures often leads to close pharmacological profiles, including binding to the same protein targets. The opposite, however, is not always true, as distinct chemical scaffolds can exhibit similar pharmacology as well. Therefore, relying on chemical similarity to known binders in search for novel chemicals targeting the same protein artificially narrows down the results and makes lead hopping impossible. In this study we attempt to design a compound similarity/distance measure that better captures structural aspects of their pharmacology and molecular interactions. The measure is based on our recently published method for compound spatial alignment with atomic property fields as a generalized 3D pharmacophoric potential. We optimized contributions of different atomic properties for better discrimination of compound pairs with the same pharmacology from those with different pharmacology using Partial Least Squares regression. Our proposed similarity measure was then tested for its ability to discriminate pharmacologically similar pairs from decoys on a large diverse dataset of 115 protein–ligand complexes. Compared to 2D Tanimoto and Shape Tanimoto approaches, our new approach led to improvement in the area under the receiver operating characteristic curve values in 66 and 58% of domains respectively. The improvement was particularly high for the previously problematic cases (weak performance of the 2D Tanimoto and Shape Tanimoto measures) with original AUC values below 0.8. In fact for these cases we obtained improvement in 86% of domains compare to 2D Tanimoto measure and 85% compare to Shape Tanimoto measure. The proposed spatial chemical distance measure can be used in virtual ligand screening.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号