首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
This paper compares the performance of two clustering methods; DPClus graph clustering and hierarchical clustering to classify volatile organic compounds (VOCs) using fingerprint-based similarity measure between chemical structures. The clustering results from each method were compared to determine the degree of cluster overlap and how well it classified chemical structures of VOCs into clusters. Additionally, we also point out the advantages and limitations of both clustering methods. In conclusion, chemical similarity measure can be used to predict biological activities of a compound and this can be applied in the medical, pharmaceutical and agrotechnology fields.  相似文献   

2.
A wide variety of computational algorithms have been developed that strive to capture the chemical similarity between two compounds for use in virtual screening and lead discovery. One limitation of such approaches is that, while a returned similarity value reflects the perceived degree of relatedness between any two compounds, there is no direct correlation between this value and the expectation or confidence that any two molecules will in fact be equally active. A lack of a common framework for interpretation of similarity measures also confounds the reliable fusion of information from different algorithms. Here, we present a probabilistic framework for interpreting similarity measures that directly correlates the similarity value to a quantitative expectation that two molecules will in fact be equipotent. The approach is based on extensive benchmarking of 10 different similarity methods (MACCS keys, Daylight fingerprints, maximum common subgraphs, rapid overlay of chemical structures (ROCS) shape similarity, and six connectivity-based fingerprints) against a database of more than 150,000 compounds with activity data against 23 protein targets. Given this unified and probabilistic framework for interpreting chemical similarity, principles derived from decision theory can then be applied to combine the evidence from different similarity measures in such a way that both capitalizes on the strengths of the individual approaches and maintains a quantitative estimate of the likelihood that any two molecules will exhibit similar biological activity.  相似文献   

3.
Increasingly, chemical libraries are being produced which are focused on a biological target or group of related targets, rather than simply being constructed in a combinatorial fashion. A screening collection compiled from such libraries will contain multiple analogues of a number of discrete series of compounds. The question arises as to how many analogues are necessary to represent each series in order to ensure that an active series will be identified. Based on a simple probabilistic argument and supported by in-house screening data, guidelines are given for the number of compounds necessary to achieve a "hit", or series of hits, at various levels of certainty. Obtaining more than one hit from the same series is useful since this gives early acquisition of SAR (structure-activity relationship) and confirms a hit is not a singleton. We show that screening collections composed of only small numbers of analogues of each series are sub-optimal for SAR acquisition. Based on these studies, we recommend a minimum series size of about 200 compounds. This gives a high probability of confirmatory SAR (i.e. at least two hits from the same series). More substantial early SAR (at least 5 hits from the same series) can be gained by using series of about 650 compounds each. With this level of information being generated, more accurate assessment of the likely success of the series in hit-to-lead and later stage development becomes possible.  相似文献   

4.
The use of multi-dimensional “chemistry spaces” to represent large compound collections has become widespread in pharmaceutical research. In such spaces compounds are treated as points. Points in close proximity represent similar compounds, while distant points represent dissimilar compounds. Assessing the diversity of a compound collection, thus, is tantamount to characterizing the distribution of points in chemistry space. To facilitate many procedures such as selecting subsets of compounds for screening, for compound acquisition and designing combinatorial libraries, chemistry spaces have been partitioned into sets of non-overlapping, multi-dimensional cells, which are generated by dividing each axis into a number of equal-sized bins. This leads to a lattice of (Nbins)Ndim{(N_{bins})^{N_{\rm dim}}} cells, where N bins is the number of bins on each axis and N dim is the dimensionality of the space. One diversity measure that is typically used in cell-based chemistry spaces is identical in form to Shannon entropy, DNcpdcpd{D_{N_{cpd}}^{cpd}} A normalized measure of this Shannon entropy given by, Drelcpd{D_{rel}^{cpd}} enables comparison between compound collections that occupy different number of occupied cells. Although Drelcpd{D_{rel}^{cpd}} characterizes the uniformity and “spreadout” of the corresponding compound collection, it treats cells as positionally independent. Some of the positional information lost can be recaptured by another diversity measure, which is also related in form to Shannon entropy. This new measure DNbincell (l){D_{N_{bin}}^{cell} (\lambda)} characterizes the distribution of occupied cells along each axis of chemistry space. The normalized measure á Drelcell ñ{\left\langle {D_{rel}^{cell}}\right\rangle} over all axes is given then by the average. Examples illustrating the applicability of these two Shannon-like measures to compound collections are presented.  相似文献   

5.
Euclidean geometry and information and fuzzy-set theory are used to develop general criteria for the evaluation of clustering methods. A separation function, describing the geometric clustering in a feature space for a given separation state, is introduced. Suitable clustering algorithms for given data can be selected by using the measure derived. The criteria developed are used in studies of the homogeneity of solids.  相似文献   

6.
As a result of the recent developments of high-throughput screening in drug discovery, the number of available screening compounds has been growing rapidly. Chemical vendors provide millions of compounds; however, these compounds are highly redundant. Clustering analysis, a technique that groups similar compounds into families, can be used to analyze such redundancy. Many available clustering methods focus on accurate classification of compounds; they are slow and are not suitable for very large compound libraries. Here is described a fast clustering method based on an incremental clustering algorithm and the 2D fingerprints of compounds. This method can cluster a very large data set with millions of compounds in hours on a single computer. A program implemented with this method, called cd-hit-fp, is available from http://chemspace.org.  相似文献   

7.
8.
9.
Methods and algorithms for predicting the properties of chemical compounds by common fragments of their molecular graphs are described. The prediction algorithms are based on determination of a measure of structural proximity (distance) between molecular graphs, which depends on the size of their common fragment. The prediction procedure involves the following steps: partitioning the property classes of the training sample compounds into subclasses of structurally similar compounds; seeking structurally typical compounds and their fragments in each subclass; classifying control compounds according to their distances from the training sample compounds or fragments of classes; forming a set of essential fragments of samples potentially responsible for the properties exhibited by the compounds. The algorithms were successfully tested in the BACC system for analyzing and classifying biologically active compounds designed at the Institute of Mathematics, Siberian Branch, Russian Academy of Sciences. S. L. Sobolev Institute of Mathematics, Siberian Branch, Russian Academy of Sciences. Translated fromZhurnal Strukturnoi Khimii, Vol. 39, No. 1, pp. 113–125, January–February, 1998.  相似文献   

10.
Using data mining techniques, we have studied a subset (1400) of compounds from the large public National Cancer Institute (NCI) compounds data repository. We first carried out a functional class identity assignment for the 60 NCI cancer testing cell lines via hierarchical clustering of gene expression data. Comprised of nine clinical tissue types, the 60 cell lines were placed into six classes-melanoma, leukemia, renal, lung, and colorectal, and the sixth class was comprised of mixed tissue cell lines not found in any of the other five classes. We then carried out supervised machine learning, using the GI(50) values tested on a panel of 60 NCI cancer cell lines. For separate 3-class and 2-class problem clustering, we successfully carried out clear cell line class separation at high stringency, p < 0.01 (Bonferroni corrected t-statistic), using feature reduction clustering algorithms embedded in RadViz, an integrated high dimensional analytic and visualization tool. We started with the 1400 compound GI(50) values as input and selected only those compounds, or features, significant in carrying out the classification. With this approach, we identified two small sets of compounds that were most effective in carrying out complete class separation of the melanoma, non-melanoma classes and leukemia, non-leukemia classes. To validate these results, we showed that these two compound sets' GI(50) values were highly accurate classifiers using five standard analytical algorithms. One compound set was most effective against the melanoma class cell lines (14 compounds), and the other set was most effective against the leukemia class cell lines (30 compounds). The two compound classes were both significantly enriched in two different types of substituted p-quinones. The melanoma cell line class of 14 compounds was comprised of 11 compounds that were internal substituted p-quinones, and the leukemia cell line class of 30 compounds was comprised of 6 compounds that were external substituted p-quinones. Attempts to subclassify melanoma or leukemia cell lines based upon their clinical cancer subtype met with limited success. For example, using GI(50) values for the 30 compounds we identified as effective against all leukemia cell lines, we could subclassify acute lymphoblastic leukemia (ALL) origin cell lines from non-ALL leukemia origin cell lines without significant overlap from non-leukemia cell lines. Based upon clustering using GI(50) values for the 60 cancer cell lines laid out by the RadViz algorithm, these two compound subsets did not overlap with clusters containing any of the NCI's 92 compounds of known mechanism of action, a few of which are quinones. Given their structural patterns, the two p-quinone subtypes we identified would clearly be expected to possess different redox potentials/substrate specificities for enzymatic reduction in vivo. These two p-quinone subtypes represent valuable information that may be used in the elucidation of pharmacophores for the design of compounds to treat these two cancer tissue types in the clinic.  相似文献   

11.
12.
In this review, we discuss a number of computational methods that have been developed or adapted for molecule classification and virtual screening (VS) of compound databases. In particular, we focus on approaches that are complementary to high-throughput screening (HTS). The discussion is limited to VS methods that operate at the small molecular level, which is often called ligand-based VS (LBVS), and does not take into account docking algorithms or other structure-based screening tools. We describe areas that greatly benefit from combining virtual and biological screening and discuss computational methods that are most suitable to contribute to the integration of screening technologies. Relevant approaches range from established methods such as clustering or similarity searching to techniques that have only recently been introduced for LBVS applications such as statistical methods or support vector machines. Finally, we discuss a number of representative applications at the interface between VS and HTS.  相似文献   

13.
This study describes the analysis of total hops essential oils from 18 cultivated varieties of hops, five of which were bred in Lithuania, and 7 wild hop forms using gas chromatography-mass spectrometry. The study sought to organise the samples of hops into clusters, according to 72 semi-volatile compounds, by applying a well-known method, k-means clustering analysis and to identify the origin of the Lithuanian hop varieties. The bouquet of the hops essential oil was composed of various esters, terpenes, hydrocarbons and ketones. Monoterpenes (mainly β-myrcene), sesquiterpenes (dominated by β-caryophyllene and α-humulene) and oxygenated sesquiterpenes (mainly caryophyllene oxide and humulene epoxide II) were the main compound groups detected in the samples tested. The above compounds, together with a-muurolene, were the only compounds found in all the samples. Qualitative and quantitative differences were observed in the composition of the essential oils of the hop varieties analysed. For successful and statistically significant clustering of the data obtained, expertise and skills in employing chemometric analysis methods are necessary. The result is also highly dependent on the set of samples (representativeness) used for segmentation into groups, the technique for pre-processing the data, the method selected for partitioning the samples according to the similarity measures chosen, etc. To achieve a large and representative data set for clustering analysis from a small number of measurements, numerical simulation was applied using the Monte Carlo method with normal and uniform distributions and several relative standard deviation values. The grouping was performed using the k-means clustering method, employing several optimal number of clusters evaluation techniques (Davies-Bouldin index, distortion function, etc.) and different data pre-processing approaches. The hop samples analysed were separated into 3 and 5 clusters according to the data filtering scenario used. However, the targeted Lithuanian hop varieties were clustered identically in both cases and fell into the same group together with other cultivated hop varieties from Ukraine and Poland.  相似文献   

14.
The here presented Empty Space index (ES) evaluates the fraction of the information space without experimental points, i.e. the space where the distance from an experimental point is significantly larger than the mean distance between the experimental points themselves. ES can be used to eliminate the ambiguity of the some clustering indexes, that aim to evaluate the separation of the data set in clusters, but these clustering indexes are really a mixed measure of clustering, of empty space (the empty space does not necessarily correspond to the break between clusters) and of the degree of uniformity of the objects. The ES index can be used also to correct the MST index, the clustering index based on the distribution of edge lengths in the minimum spanning tree connecting the objects. The corrected MST index seems to be a reliable measure of the clustering degree.  相似文献   

15.
Counting compounds (rather than papers or citations) offers a new perspective for quantitative analyses of research activities. First of all, we can precisely define (compound-related) research topics and access the corresponding publications (scientific papers as well as patents) as a measure of research activity. We can also establish the time evolution of the publications dealing with specific compounds or compound classes. Moreover, the mapping of compounds by establishing compound-based landscapes has some potential to visualize the compound basis of research topics for further research activities. We have analyzed the rare earth compounds to give an example of a broad compound class. We present the number of the currently existing compounds and of the corresponding publications as well as the time evolution of the papers and patents. Furthermore, we have analyzed the rare earth cuprates (copper oxides) as an example of a narrower compound class to demonstrate the potential of mapping compounds by compound-based landscapes. We have quantified the various element combinations of the existing compounds and revealed all element combinations not yet realized in the synthesis within this compound class. Finally, we have analyzed the quasicrystal compound category as an example of a compound class that is not defined by a specific element combination or a molecular structure.  相似文献   

16.
In this paper we propose a new method based on measurements of the structural similarity for the clustering of chemical databases. The proposed method allows the dynamic adjustment of the size and number of cells or clusters in which the database is classified. Classification is carried out using measurements of structural similarity obtained from the matching of molecular graphs. The classification process is open to the use of different similarity indexes and different measurements of matching. This process consists of the projection of the obtained measures of similarity among the elements of the database in a new space of similarity. The possibility of the dynamic readjustment of the dimension and characteristic of the projection space to adapt to the most favorable conditions of the problem under study and the simplicity and computational efficiency make the proposed method appropriate for its use with medium and large databases. The clustering method increases the performance of the screening processes in chemical databases, facilitating the recovery of chemical compounds that share all or subsets of common substructures to a given pattern. For the realization of the work a database of 498 natural compounds with wide molecular diversity extracted from SPECS and BIOSPECS B.V. free database has been used.  相似文献   

17.
SNM1A is a nuclease that is implicated in DNA interstrand crosslink repair and, as such, its inhibition is of interest for overcoming resistance to chemotherapeutic crosslinking agents. However, the number and identity of the metal ion(s) in the active site of SNM1A are still unconfirmed, and only a limited number of inhibitors have been reported to date. Herein, we report the synthesis and evaluation of a family of malonate-based modified nucleosides to investigate the optimal positioning of metal-binding groups in nucleoside-derived inhibitors for SNM1A. These compounds include ester, carboxylate and hydroxamic acid malonate derivatives which were installed in the 5′-position or 3′-position of thymidine or as a linkage between two nucleosides. Evaluation as inhibitors of recombinant SNM1A showed that nine of the twelve compounds tested had an inhibitory effect at 1 mM concentration. The most potent compound contains a hydroxamic acid malonate group at the 5′-position. Overall, our studies advance the understanding of requirements for nucleoside-derived inhibitors for SNM1A and indicate that groups containing a negatively charged group in close proximity to a metal chelator, such as hydroxamic acid malonates, are promising structures in the design of inhibitors.  相似文献   

18.
Similarity measures based on the comparison of dense bit vectors of two-dimensional chemical features are a dominant method in chemical informatics. For large-scale problems, including compound selection and machine learning, computing the intersection between two dense bit vectors is the overwhelming bottleneck. We describe efficient implementations of this primitive as well as example applications using features of modern CPUs that allow 20-40× performance increases relative to typical code. Specifically, we describe fast methods for population count on modern x86 processors and cache-efficient matrix traversal and leader clustering algorithms that alleviate memory bandwidth bottlenecks in similarity matrix construction and clustering. The speed of our 2D comparison primitives is within a small factor of that obtained on GPUs and does not require specialized hardware.  相似文献   

19.
Clustering methods have been widely used to group together similar conformational states from molecular simulations of biomolecules in solution. For applications such as the interaction of a protein with a surface, the orientation of the protein relative to the surface is also an important clustering parameter because of its potential effect on adsorbed‐state bioactivity. This study presents cluster analysis methods that are specifically designed for systems where both molecular orientation and conformation are important, and the methods are demonstrated using test cases of adsorbed proteins for validation. Additionally, because cluster analysis can be a very subjective process, an objective procedure for identifying both the optimal number of clusters and the best clustering algorithm to be applied to analyze a given dataset is presented. The method is demonstrated for several agglomerative hierarchical clustering algorithms used in conjunction with three cluster validation techniques. © 2016 Wiley Periodicals, Inc.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号