首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present a novel approach for enhancing the diversity of a chemical library rooted on the theory of the wisdom of crowds. Our approach was motivated by a desire to tap into the collective experience of our global medicinal chemistry community and involved four basic steps: (1) Candidate compounds for acquisition were screened using various structural and property filters in order to eliminate clearly nondrug-like matter. (2) The remaining compounds were clustered together with our in-house collection using a novel fingerprint-based clustering algorithm that emphasizes common substructures and works with millions of molecules. (3) Clusters populated exclusively by external compounds were identified as "diversity holes," and representative members of these clusters were presented to our global medicinal chemistry community, who were asked to specify which ones they liked, disliked, or were indifferent to using a simple point-and-click interface. (4) The resulting votes were used to rank the clusters from most to least desirable, and to prioritize which ones should be targeted for acquisition. Analysis of the voting results reveals interesting voter behaviors and distinct preferences for certain molecular property ranges that are fully consistent with lead-like profiles established through systematic analysis of large historical databases.  相似文献   

2.
Medicinal chemists have traditionally realized assessments of chemical diversity and subsequent compound acquisition, although a recent study suggests that experts are usually inconsistent in reviewing large data sets. To analyze the scaffold diversity of commercially available screening collections, we have developed a general workflow aimed at (1) identifying druglike compounds, (2) clustering them by maximum common substructures (scaffolds), (3) measuring the scaffold diversity encoded by each screening collection independently of its size, and finally (4) merging all common substructures in a nonredundant scaffold library that can easily be browsed by structural and topological queries. Starting from 2.4 million compounds out of 12 commercial sources, four categories of libraries could be identified: large- and medium-sized combinatorial libraries (low scaffold diversity), diverse libraries (medium diversity, medium size), and highly diverse libraries (high diversity, low size). The chemical space covered by the scaffold library can be searched to prioritize scaffold-focused libraries.  相似文献   

3.
4.
In this paper we propose a new method based on measurements of the structural similarity for the clustering of chemical databases. The proposed method allows the dynamic adjustment of the size and number of cells or clusters in which the database is classified. Classification is carried out using measurements of structural similarity obtained from the matching of molecular graphs. The classification process is open to the use of different similarity indexes and different measurements of matching. This process consists of the projection of the obtained measures of similarity among the elements of the database in a new space of similarity. The possibility of the dynamic readjustment of the dimension and characteristic of the projection space to adapt to the most favorable conditions of the problem under study and the simplicity and computational efficiency make the proposed method appropriate for its use with medium and large databases. The clustering method increases the performance of the screening processes in chemical databases, facilitating the recovery of chemical compounds that share all or subsets of common substructures to a given pattern. For the realization of the work a database of 498 natural compounds with wide molecular diversity extracted from SPECS and BIOSPECS B.V. free database has been used.  相似文献   

5.
Four different two-dimensional fingerprint types (MACCS, Unity, BCI, and Daylight) and nine methods of selecting optimal cluster levels from the output of a hierarchical clustering algorithm were evaluated for their ability to select clusters that represent chemical series present in some typical examples of chemical compound data sets. The methods were evaluated using a Ward's clustering algorithm on subsets of the publicly available National Cancer Institute HIV data set, as well as with compounds from our corporate data set. We make a number of observations and recommendations about the choice of fingerprint type and cluster level selection methods for use in this type of clustering  相似文献   

6.
This publication describes processes for the selection of chemical compounds for the building of a high-throughput screening (HTS) collection for drug discovery, using the currently implemented process in the Discovery Technologies Unit of the Novartis Institute for Biomedical Research, Basel Switzerland as reference. More generally, the currently existing compound acquisition models and practices are discussed. Our informatics, chemistry and biology-driven compound selection consists of two steps: 1) The individual compounds are filtered and grouped into three priority classes on the basis of their individual structural properties. Substructure filters are used to eliminate or penalize compounds based on unwanted structural properties. The similarity of the structures to reference ligands of the main proven druggable target families is computed, and drug-similar compounds are prioritized for the following diversity analysis. 2) The compounds are compared to the archive compounds and a diversity analysis is performed. This is done separately for the prioritized, regular and penalized compounds with increasingly stringent dissimilarity criterion. The process includes collecting vendor catalogues and monitoring the availability of samples together with the selection and purchase decision points. The development of a corporate vendor catalogue database is described. In addition to the selection methods on a per single molecule basis, selection criteria for scaffold and combinatorial chemistry projects in collaboration with compound vendors are discussed.  相似文献   

7.
In this paper, we propose a new method for clustering of chemical databases based on the representation of measurements of structural similarity onto multidimensional spaces. The proposed method permits the tuning of the clustering process through the selection of the dimension of the projection space, the normal vectors and the sensibility of the projection process. The structural similarity of each element regarding to the database elements is projected onto the defined spaces generating clusters that represent the characteristics and diversity of the database and whose size and characteristics can be easily adjusted.  相似文献   

8.
In this paper we introduce a quantitative model that relates chemical structural similarity to biological activity, and in particular to the activity of lead series of compounds in high-throughput assays. From this model we derive the optimal screening collection make up for a given fixed size of screening collection, and identify the conditions under which a diverse collection of compounds or a collection focusing on particular regions of chemical space are appropriate strategies. We derive from the model a diversity function that may be used to assess compounds for acquisition or libraries for combinatorial synthesis by their ability to complement an existing screening collection. The diversity function is linked directly through the model to the goal of more frequent discovery of lead series from high-throughput screening. We show how the model may also be used to derive relationships between collection size and probabilities of lead discovery in high-throughput screening, and to guide the judicious application of structural filters.  相似文献   

9.
10.
METAPRINT, a metabolic fingerprint, has been developed by predicting metabolic pathways and corresponding potential metabolites. Calculated drug-likeness parameters (log P and MW) have been incorporated into METAPRINT to allow the encoding of metabolic diversity within a chemical library. The application of METAPRINT in the design of cassette dosing experiments is demonstrated using a library of alpha-1a antagonists synthesized at Glaxo Wellcome. Results obtained by Ward's clustering algorithm suggest that METAPRINTs are able to discriminate between low- and high-clearance compounds. Cassette design was performed by maximizing the intracassette Euclidean distances between compounds in METAPRINT space, using simulated annealing. Calculated distances in METAPRINT space were in accordance with experimental data.  相似文献   

11.
Some modifications were introduced into the previously described Centroid diversity sorting algorithm, which uses cosine similarity metric. The modified algorithm is suitable for the work with large databases on personal computers. For example, for diversity sorting of the database with the size greater than a million of records, less than 9 h are required (Pentium III, 800 MHz). The problem of selecting new compounds into the existing collection is examined to reach the maximum diversity of the collection. The article describes the new algorithm for the selection of heterocyclic compounds.  相似文献   

12.
13.
Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n 1/4), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.  相似文献   

14.
15.
16.
We present an efficient method to cluster large chemical databases in a stepwise manner. Databases are first clustered with an extended exclusion sphere algorithm based on Tanimoto coefficients calculated from Daylight fingerprints. Substructures are then extracted from clusters by iterative application of a maximum common substructure algorithm. Clusters with common substructures are merged through a second application of an exclusion sphere algorithm. In a separate step, singletons are compared to cluster substructures and added to a cluster if similarity is sufficiently high. The method identifies tight clusters with conserved substructures and generates singletons only if structures are truly distinct from all other library members. The method has successfully been applied to identify the most frequently occurring scaffolds in databases, for the selection of analogues of screening hits and in the prioritization of chemical libraries offered by commercial vendors.  相似文献   

17.
18.
The assembly of large compound libraries for the purpose of screening against various receptor targets to identify chemical leads for drug discovery programs has created a need for methods to measure the molecular diversity of such libraries. The method described here, for which we propose the acronym RESIS (for Receptor Site Interaction Simulation), relates directly to this use. A database is built of three-dimensional representations of the compounds in the library and a set of three-point three-dimensional theoretical receptor sites is generated based on putative hydrophobic and polar interactions. A series of flexible, three-dimensional searches is then performed over the database, using each of the theoretical sites as the basis for one such search. The resulting pattern of hits across the grid of theoretical receptor sites provides a measure of the molecular diversity of the compound library. This can be conveniently displayed as a density map which provides a readily comprehensible visual impression of the library diversity characteristics. A library of 7500 drug compounds derived from the CIPSLINEPC databases was characterized with respect to molecular diversity using the RESIS method. Some specific uses for the information obtained from application of the method are discussed. A comparison was made of the results from the RESIS method with those from a recently published two-dimensional approach for assessing molecular diversity using sets of compounds from the Maybridge database (MAY).  相似文献   

19.
In this paper, we propose an algorithm for the design of lead generation libraries required in combinatorial drug discovery. This algorithm addresses simultaneously the two key criteria of diversity and representativeness of compounds in the resulting library and is computationally efficient when applied to a large class of lead generation design problems. At the same time, additional constraints on experimental resources are also incorporated in the framework presented in this paper. A computationally efficient scalable algorithm is developed, where the ability of the deterministic annealing algorithm to identify clusters is exploited to truncate computations over the entire data set to computations over individual clusters. An analysis of this algorithm quantifies the tradeoff between the error due to truncation and computational effort. Results applied on test data sets corroborate the analysis and show improvement by factors as large as 10 or more, depending on the data sets.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号