首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Associative neural network (ASNN) represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbor technique. This method uses the correlation between ensemble responses as a measure of distance amid the analyzed cases for the nearest neighbor technique. This provides an improved prediction by the bias correction of the neural network ensemble. An associative neural network has a memory that can coincide with the training set. If new data becomes available, the network further improves its predictive ability and provides a reasonable approximation of the unknown function without a need to retrain the neural network ensemble. This feature of the method dramatically improves its predictive ability over traditional neural networks and k-nearest neighbor techniques, as demonstrated using several artificial data sets and a program to predict lipophilicity of chemical compounds. Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models. It is shown that analysis of such correlations makes it possible to provide "property-targeted" clustering of data. The possible applications and importance of ASNN in drug design and medicinal and combinatorial chemistry are discussed. The method is available on-line at http://www.vcclab.org/lab/asnn.  相似文献   

2.
Jure Zupan 《Mikrochimica acta》1986,89(1-6):243-260
The main aspects of handling large amounts of complex information, in general, and spectroscopic data, in particular, are described and discussed. In the first part, the basic terms and procedures are defined and illustrated with examples from different spectrometries. The items such as the representation of complex data of different types, measurement — and information — pace, metrics of a multi-dimensional space, different types of transformations, information content of a given representation, the concept of a frame, the holistic and reductionistic aspect of information are explained in more detail.In the second part, the organization of complex data with the intention of forming an expert system is discussed. Emphasis is put on the clustering of data, on the criteria for clustering, and on the ways and means governing the formation of hierarchies of clusters and frames they represent. Furtheron, a model for an expert system (which is able to acquire new information, i. e. which is able to learn, and to use the acquired knowledge for predicting the structural features of unknown compounds) based on the hierarchical organization of a large amount of data is outlined.Finally, the prospects and limitations of expert systems based on the hierarchical clustering of large data collections are discussed.  相似文献   

3.
4.
The National Cancer Institute Division of Cancer Treatment has revised its drug-screening program. About 230,000 compounds in our repository are available for screening under the new protocol. This paper is the first on an attempt to extract a representative sample of these compounds by clustering. It reviews the establishment of the clustering method on a 4980-compound initial sample. The clustering algorithm is fairly simple. However, the molecular fragments employed to match the compounds are somewhat complex to distinguish a large number of compounds.  相似文献   

5.
The tremendous increase in chemical structure and biological activity data brought about through combinatorial chemistry and high-throughput screening technologies has created the need for sophisticated graphical tools for visualizing and exploring structure-activity data. Visualization plays an important role in exploring and understanding relationships within such multidimensional data sets. Many chemoinformatics software applications apply standard clustering techniques to organize structure-activity data, but they differ significantly in their approaches to visualizing clustered data. Molecular Property eXplorer (MPX) is unique in its presentation of clustered data in the form of heatmaps and tree-maps. MPX employs agglomerative hierarchical clustering to organize data on the basis of the similarity between 2D chemical structures or similarity across a predefined profile of biological assay values. Visualization of hierarchical clusters as tree-maps and heatmaps provides simultaneous representation of cluster members along with their associated assay values. Tree-maps convey both the spatial relationship among cluster members and the value of a single property (activity) associated with each member. Heatmaps provide visualization of the cluster members across an activity profile. Unlike a tree-map, however, a heatmap does not convey the spatial relationship between cluster members. MPX seamlessly integrates tree-maps and heatmaps to represent multidimensional structure-activity data in a visually intuitive manner. In addition, MPX provides tools for clustering data on the basis of chemical structure or activity profile, displaying 2D chemical structures, and querying the data based over a specified activity range, or set of chemical structure criteria (e.g., Tanimoto similarity, substructure match, and "R-group" analysis).  相似文献   

6.
This paper describes the first application of fuzzy c-means clustering for the selection of representatives from assemblies of conformations or alignments. In case of alignments, their quality is taken into account using a weighted c-means scheme, developed in this work. The performance of fuzzy cluster validity measures, such as compactness, partition function, and entropy, are studied on several examples, but the visual 3D representation of data points is shown to be most beneficial in determining the optimum number of clusters. Fuzzy clustering is expected to perform better than crisp clustering methods in cases where there are a significant number of "outliers", such as in molecular dynamics simulations and molecular alignments.  相似文献   

7.
Summary We describe an approach to protein structure comparison designed to detect distantly related proteins of similar fold, where the procedure must be sufficiently flexible to take into account the elasticity of protein folds without losing specificity. Protein structures are represented as a series of secondary structure elements, where for each element a local environment describes its relations with the elements that surround it. Secondary structures are then aligned by comparing their features and local environments. The procedure is illustrated with searches of a database of 468 protein structures in order to identify proteins of similar topology to porcine pepsin, porphobilinogen deaminase and serum amyloid P-component. In all cases the searches correctly identify protein structures of similar fold as the search proteins. Multiple cross-comparisons of protein structures allow the clustering of proteins of similar fold. This is exemplified with a clustering of /- and -class protein structures. We discuss applications of the comparison and clustering of three-dimensional protein structures to comparative modelling and structure-based protein design.  相似文献   

8.
Nanoscale atomic clusters in atom probe tomographic data are not universally defined but instead are characterized by the clustering algorithm used and the parameter values controlling the algorithmic process. A new core-linkage clustering algorithm is developed, combining fundamental elements of the conventional maximum separation method with density-based analyses. A key improvement to the algorithm is the independence of algorithmic parameters inherently unified in previous techniques, enabling a more accurate analysis to be applied across a wider range of material systems. Further, an objective procedure for the selection of parameters based on approximating the data with a model of complete spatial randomness is developed and applied. The use of higher nearest neighbor distributions is highlighted to give insight into the nature of the clustering phenomena present in a system and to generalize the clustering algorithms used to analyze it. Maximum separation, density-based scanning, and the core linkage algorithm, developed within this study, were separately applied to the investigation of fine solute clustering of solute atoms in an Al-1.9Zn-1.7Mg (at.%) at two distinct states of early phase decomposition and the results of these analyses were evaluated.  相似文献   

9.
This study describes the analysis of total hops essential oils from 18 cultivated varieties of hops, five of which were bred in Lithuania, and 7 wild hop forms using gas chromatography-mass spectrometry. The study sought to organise the samples of hops into clusters, according to 72 semi-volatile compounds, by applying a well-known method, k-means clustering analysis and to identify the origin of the Lithuanian hop varieties. The bouquet of the hops essential oil was composed of various esters, terpenes, hydrocarbons and ketones. Monoterpenes (mainly β-myrcene), sesquiterpenes (dominated by β-caryophyllene and α-humulene) and oxygenated sesquiterpenes (mainly caryophyllene oxide and humulene epoxide II) were the main compound groups detected in the samples tested. The above compounds, together with a-muurolene, were the only compounds found in all the samples. Qualitative and quantitative differences were observed in the composition of the essential oils of the hop varieties analysed. For successful and statistically significant clustering of the data obtained, expertise and skills in employing chemometric analysis methods are necessary. The result is also highly dependent on the set of samples (representativeness) used for segmentation into groups, the technique for pre-processing the data, the method selected for partitioning the samples according to the similarity measures chosen, etc. To achieve a large and representative data set for clustering analysis from a small number of measurements, numerical simulation was applied using the Monte Carlo method with normal and uniform distributions and several relative standard deviation values. The grouping was performed using the k-means clustering method, employing several optimal number of clusters evaluation techniques (Davies-Bouldin index, distortion function, etc.) and different data pre-processing approaches. The hop samples analysed were separated into 3 and 5 clusters according to the data filtering scenario used. However, the targeted Lithuanian hop varieties were clustered identically in both cases and fell into the same group together with other cultivated hop varieties from Ukraine and Poland.  相似文献   

10.
We discuss the clustering of 234 environmental samples resulting from an extensive monitoring program concerning soil lead content, plant lead content, traffic density, and distance from the road at different sampling locations in former East Germany. Considering the structure of data and the unsatisfactory results obtained applying classical clustering and principal component analysis, it appeared evident that fuzzy clustering could be one of the best solutions. In the following order we used different fuzzy clustering algorithms, namely, the fuzzy c-means (FCM) algorithm, the Gustafson–Kessel (GK) algorithm, which may detect clusters of ellipsoidal shapes in data by introducing an adaptive distance norm for each cluster, and the fuzzy c-varieties (FCV) algorithm, which was developed for recognition of r-dimensional linear varieties in high-dimensional data (lines, planes or hyperplanes). Fuzzy clustering with convex combination of point prototypes and different multidimensional linear prototypes is also discussed and applied for the first time in analytical chemistry (environmetrics). The results obtained in this study show the advantages of the FCV and GK algorithms over the FCM algorithm. The performance of each algorithm is illustrated by graphs and evaluated by the values of some conventional cluster validity indices. The values of the validity indices are in very good agreement with the quality of the clustering results. Figure Projection of all samples on the plane defined by the membership degrees to cluster A2, and A4 obtained using Fuzzy c-varieties (FCV) algorithm (expression of objective function and distance enclosed)  相似文献   

11.
The here presented Empty Space index (ES) evaluates the fraction of the information space without experimental points, i.e. the space where the distance from an experimental point is significantly larger than the mean distance between the experimental points themselves. ES can be used to eliminate the ambiguity of the some clustering indexes, that aim to evaluate the separation of the data set in clusters, but these clustering indexes are really a mixed measure of clustering, of empty space (the empty space does not necessarily correspond to the break between clusters) and of the degree of uniformity of the objects. The ES index can be used also to correct the MST index, the clustering index based on the distribution of edge lengths in the minimum spanning tree connecting the objects. The corrected MST index seems to be a reliable measure of the clustering degree.  相似文献   

12.
Six rigid-body parameters (Shift, Slide, Rise, Tilt, Roll, Twist) are commonly used to describe the relative displacement and orientation of successive base pairs in a nucleic acid structure. The present work adapts this approach to describe the relative displacement and orientation of any two planes in an arbitrary molecule-specifically, planes which contain important pharmacophore elements. Relevant code from the 3DNA software package (Nucleic Acids Res. 2003, 31, 5108-5121) was generalized to treat molecular fragments other than DNA bases as input for the calculation of the corresponding rigid-body (or "planes") parameters. These parameters were used to construct feature vectors for a fuzzy relational clustering study of over 700 conformations of a flexible analogue of the dopamine reuptake inhibitor, GBR 12909. Several cluster validity measures were used to determine the optimal number of clusters. Translational (Shift, Slide, Rise) rather than rotational (Tilt, Roll, Twist) features dominate clustering based on planes that are relatively far apart, whereas both types of features are important to clustering when the pair of planes are close by. This approach was able to classify the data set of molecular conformations into groups and to identify representative conformers for use as template conformers in future Comparative Molecular Field Analysis studies of GBR 12909 analogues. The advantage of using the planes parameters, rather than the combination of atomic coordinates and angles between molecular planes used in our previous fuzzy relational clustering of the same data set (J. Chem. Inf. Model. 2005, 45, 610-623), is that the present clustering results are independent of molecular superposition and the technique is able to identify clusters in the molecule considered as a whole. This approach is easily generalizable to any two planes in any molecule.  相似文献   

13.
Several parallel algorithms for Fock matrix construction are described. The algorithms calculate only the unique integrals, distribute the Fock and density matrices over the processors of a massively parallel computer, use blocking techniques to construct the distributed data structures, and use clustering techniques on each processor to maximize data reuse. Algorithms based on both square and row-blocked distributions of the Fock and density matrices are described and evaluated. Variants of the algorithms are discussed that use either triple-sort or canonical ordering of integrals, and dynamic or static task clustering schemes. The algorithms are shown to adapt to screening, with communication volume scaling down with computation costs. Modeling techniques are used to characterize algorithm performance. Given the characteristics of existing massively parallel computers, all the algorithms are shown to be highly efficient for problems of moderate size. The algorithms using the row-blocked data distribution are the most efficient. © 1996 by John Wiley & Sons, Inc.  相似文献   

14.
Mass spectrometry imaging (MSI) is widely used for the label-free molecular mapping of biological samples. The identification of co-localized molecules in MSI data is crucial to the understanding of biochemical pathways. One of key challenges in molecular colocalization is that complex MSI data are too large for manual annotation but too small for training deep neural networks. Herein, we introduce a self-supervised clustering approach based on contrastive learning, which shows an excellent performance in clustering of MSI data. We train a deep convolutional neural network (CNN) using MSI data from a single experiment without manual annotations to effectively learn high-level spatial features from ion images and classify them based on molecular colocalizations. We demonstrate that contrastive learning generates ion image representations that form well-resolved clusters. Subsequent self-labeling is used to fine-tune both the CNN encoder and linear classifier based on confidently classified ion images. This new approach enables autonomous and high-throughput identification of co-localized species in MSI data, which will dramatically expand the application of spatial lipidomics, metabolomics, and proteomics in biological research.

Contrastive learning is used to train a deep convolutional neural network to identify high-level features in mass spectrometry imaging data. These features enable self-supervised clustering of ion images without manual annotation.  相似文献   

15.
Hierarchical clustering algorithms such as Wards or complete-link are commonly used in compound selection and diversity analysis. Many such applications utilize binary representations of chemical structures, such as MACCS keys or Daylight fingerprints, and dissimilarity measures, such as the Euclidean or the Soergel measure. However, hierarchical clustering algorithms can generate ambiguous results owing to what is known in the cluster analysis literature as the ties in proximity problem, i.e., compounds or clusters of compounds that are equidistant from a compound or cluster in a given collection. Ambiguous ties can occur when clustering only a few hundred compounds, and the larger the number of compounds to be clustered, the greater the chance for significant ambiguity. Namely, as the number of "ties in proximity" increases relative to the total number of proximities, the possibility of ambiguity also increases. To ensure that there are no ambiguous ties, we show by a probabilistic argument that the number of compounds needs to be less than 2(n 1/4), where n is the total number of proximities, and the measure used to generate the proximities creates a uniform distribution without statistically preferred values. The common measures do not produce uniformly distributed proximities, but rather statistically preferred values that tend to increase the number of ties in proximity. Hence, the number of possible proximities and the distribution of statistically preferred values of a similarity measure, given a bit vector representation of a specific length, are directly related to the number of ties in proximities for a given data set. We explore the ties in proximity problem, using a number of chemical collections with varying degrees of diversity, given several common similarity measures and clustering algorithms. Our results are consistent with our probabilistic argument and show that this problem is significant for relatively small compound sets.  相似文献   

16.
Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.  相似文献   

17.
This study shows how chemistry knowledge and reasoning are taken into account for building a new methodology that aims at automatically grouping data having a chronological structure. We consider combinatorial catalytic experiments where the evolution of a reaction (e.g., conversion) over time is expected to be analyzed. The mathematical tool has been developed to compare and group curves taking into account their shape. The strategy, which consists on combining a hierarchical clustering with the k-means algorithm, is described and compared with both algorithms used separately. The hybridization is shown to be of great interest. Then, a second application mode of the proposed methodology is presented. Once meaningful clusters according to chemist's preferences and goals are successfully achieved, the induced model may be used in order to automatically classify new experimental results. The grouping of the new catalysts tested for the Heck coupling reaction between styrene and iodobenzene verified the set of criteria "defined" during the initial clustering step, and facilitated a quick identification of the catalytic behaviors following user's preferences.  相似文献   

18.
Multivariate image data provide detailed information in variable and image space. Most traditional clustering methods are based on variable information only and ignore spatial information. A method based on both variable and spatial information could improve the results substantially.

In this review, we study the benefits and the pitfalls of including spatial information in chemometric clustering techniques. Spatial information is taken into account in initialization of clustering parameters, during cluster iterations by adjusting the similarity measure or at a post-processing step. We illustrate the effect of taking spatial information into account by a univariate synthetic data set and two real-world multivariate data sets. We show that methods that include neighboring pixel information in the clustering procedure improve the performance accuracy of the clustering in most cases. Homogeneous regions in the image are better recognized and the amount of noise is reduced by these methods.  相似文献   


19.
Ligand-based shape matching approaches have become established as important and popular virtual screening (VS) techniques. However, despite their relative success, many authors have discussed how best to choose the initial query compounds and which of their conformations should be used. Furthermore, it is increasingly the case that pharmaceutical companies have multiple ligands for a given target and these may bind in different ways to the same pocket. Conversely, a given ligand can sometimes bind to multiple targets, and this is clearly of great importance when considering drug side-effects. We recently introduced the notion of spherical harmonic-based "consensus shapes" to help deal with these questions. Here, we apply a consensus shape clustering approach to the 40 protein-ligand targets in the DUD data set using PARASURF/PARAFIT. Results from clustering show that in some cases the ligands for a given target are split into two subgroups which could suggest they bind to different subsites of the same target. In other cases, our clustering approach sometimes groups together ligands from different targets, and this suggests that those ligands could bind to the same targets. Hence spherical harmonic-based clustering can rapidly give cross-docking information while avoiding the expense of performing all-against-all docking calculations. We also report on the effect of the query conformation on the performance of shape-based screening of the DUD data set and the potential gain in screening performance by using consensus shapes calculated in different ways. We provide details of our analysis of shape-based screening using both PARASURF/PARAFIT and ROCS, and we compare the results obtained with shape-based and conventional docking approaches using MSSH/SHEF and GOLD. The utility of each type of query is analyzed using commonly reported statistics such as enrichment factors (EF) and receiver-operator-characteristic (ROC) plots as well as other early performance metrics.  相似文献   

20.
This poster illustrates the lecture on Pattern Recognition and gives recently published and unpublished examples, mainly from the laboratory from the first author. The applications concern:
  • - the determination of metabolic pathways of branched chain fatty acids (by clustering),
  • - the development of a genetic classification of meteorites (by clustering),
  • - the classification of cholinergic agents according to their interaction with different receptors (by clustering),
  • - the structure of a data set consisting of gaschromatographic profiles in samples collected in pollution monitoring stations (by factor analysis and pattern recognition),
  • - factors determining GLC behaviour of solutes (by factor analysis and multiple regression),
  • - the classification of olive oils according to geographic origin (by principal components and pattern recognition),
  • - the diagnosis of thyroid status (by pattern recognition).
  •   相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号