首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
DNA arrays have become the immediate choice in the analysis of large-scale expression measurements. Understanding the expression pattern of genes provide functional information on newly identified genes by computational approaches. Gene expression pattern is an indicator of the state of the cell, and abnormal cellular states can be inferred by comparing expression profiles. Since co-regulated genes, and genes involved in a particular pathway, tend to show similar expression patterns, clustering expression patterns has become the natural method of choice to differentiate groups. However, most methods based on cluster analysis suffer from the usual problems (i) dead units, and (ii) the problem of determining the correct number of clusters (k) needed to classify the data. Selecting the k has been an open problem of pattern recognition and statistics for decades. Since clustering reveals similar patterns present in the data, fixing this number strongly influences the quality of the result. While there is no theoretical solution to this problem, the number of clusters can be decided by a heuristic clustering algorithm called rival penalized competitive learning (RPCL). We present a novel implementation of RPCL that transforms the correct number of clusters problem to the tractable problem of clustering based on the degree of similarity. This is biologically significant since our implementation clusters functionally co-regulated genes and genes that present similar patterns of expression. This new approach reveals potential genes that are co-involved in a biological process. This implementation of the RPCL algorithm is useful in differentiating groups involved in concerted functional regulation and helps to progressively home into patterns, which are closely similar.  相似文献   

2.
Single-cell RNA sequencing technologies have revolutionized biomedical research by providing an effective means to profile gene expressions in individual cells. One of the first fundamental steps to perform the in-depth analysis of single-cell sequencing data is cell type classification and identification. Computational methods such as clustering algorithms have been utilized and gaining in popularity because they can save considerable resources and time for experimental validations. Although selecting the optimal features (i.e., genes) is an essential process to obtain accurate and reliable single-cell clustering results, the computational complexity and dropout events that can introduce zero-inflated noise make this process very challenging. In this paper, we propose an effective single-cell clustering algorithm based on the ensemble feature selection and similarity measurements. We initially identify the set of potential features, then measure the cell-to-cell similarity based on the subset of the potentials through multiple feature sampling approaches. We construct the ensemble network based on cell-to-cell similarity. Finally, we apply a network-based clustering algorithm to obtain single-cell clusters. We evaluate the performance of our proposed algorithm through multiple assessments in real-world single-cell RNA sequencing datasets with known cell types. The results show that our proposed algorithm can identify accurate and consistent single-cell clustering. Moreover, the proposed algorithm takes relative expression as input, so it can easily be adopted by existing analysis pipelines. The source code has been made publicly available at https://github.com/jeonglab/scCLUE.  相似文献   

3.
4.
Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.  相似文献   

5.
Multispectral images such as multispectral chemical images or multispectral satellite images provide detailed data with information in both the spatial and spectral domains. Many segmentation methods for multispectral images are based on a per-pixel classification, which uses only spectral information and ignores spatial information. A clustering algorithm based on both spectral and spatial information would produce better results.

In this work, spatial refinement clustering (SpaRef), a new clustering algorithm for multispectral images is presented. Spatial information is integrated with partitional and agglomeration clustering processes. The number of clusters is automatically identified. SpaRef is compared with a set of well-known clustering methods on compact airborne spectrographic imager (CASI) over an area in the Klompenwaard, The Netherlands. The clusters obtained show improved results. Applying SpaRef to multispectral chemical images would be a straight-forward step.  相似文献   


6.
Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised classification algorithm which has been widely used in many areas with its simplicity and its ability to deal with hidden clusters of different sizes and shapes and with noise. However, the computational issue of the distance table and the non-stability in detecting the boundaries of adjacent clusters limit the application of the original algorithm to large datasets such as images. In this paper, the DBSCAN algorithm was revised and improved for image clustering and segmentation. The proposed clustering algorithm presents two major advantages over the original one. Firstly, the revised DBSCAN algorithm made it applicable for large 3D image dataset (often with millions of pixels) by using the coordinate system of the image data. Secondly, the revised algorithm solved the non-stability issue of boundary detection in the original DBSCAN. For broader applications, the image dataset can be ordinary 3D images or in general, it can also be a classification result of other type of image data e.g. a multivariate image.  相似文献   

7.
We discuss the clustering of 234 environmental samples resulting from an extensive monitoring program concerning soil lead content, plant lead content, traffic density, and distance from the road at different sampling locations in former East Germany. Considering the structure of data and the unsatisfactory results obtained applying classical clustering and principal component analysis, it appeared evident that fuzzy clustering could be one of the best solutions. In the following order we used different fuzzy clustering algorithms, namely, the fuzzy c-means (FCM) algorithm, the Gustafson–Kessel (GK) algorithm, which may detect clusters of ellipsoidal shapes in data by introducing an adaptive distance norm for each cluster, and the fuzzy c-varieties (FCV) algorithm, which was developed for recognition of r-dimensional linear varieties in high-dimensional data (lines, planes or hyperplanes). Fuzzy clustering with convex combination of point prototypes and different multidimensional linear prototypes is also discussed and applied for the first time in analytical chemistry (environmetrics). The results obtained in this study show the advantages of the FCV and GK algorithms over the FCM algorithm. The performance of each algorithm is illustrated by graphs and evaluated by the values of some conventional cluster validity indices. The values of the validity indices are in very good agreement with the quality of the clustering results. Figure Projection of all samples on the plane defined by the membership degrees to cluster A2, and A4 obtained using Fuzzy c-varieties (FCV) algorithm (expression of objective function and distance enclosed)  相似文献   

8.
In this paper, the performance of new clustering methods such as Neural Gas (NG) and Growing Neural Gas (GNG) is compared with the K-means method for real and simulated data sets. Moreover, a new algorithm called growing K-means, GK, is introduced as the alternative to Neural Gas and Growing Neural Gas. It has small input requirements and is conceptually very simple. The GK leads to nearly optimal values of the cost function, and, contrary to K-means, it is independent of the initial data set partition. The incremental property of GK additionally helps to estimate the number of "natural" clusters in data, i.e., the well-separated groups of objects in the data space.  相似文献   

9.
Nanoscale atomic clusters in atom probe tomographic data are not universally defined but instead are characterized by the clustering algorithm used and the parameter values controlling the algorithmic process. A new core-linkage clustering algorithm is developed, combining fundamental elements of the conventional maximum separation method with density-based analyses. A key improvement to the algorithm is the independence of algorithmic parameters inherently unified in previous techniques, enabling a more accurate analysis to be applied across a wider range of material systems. Further, an objective procedure for the selection of parameters based on approximating the data with a model of complete spatial randomness is developed and applied. The use of higher nearest neighbor distributions is highlighted to give insight into the nature of the clustering phenomena present in a system and to generalize the clustering algorithms used to analyze it. Maximum separation, density-based scanning, and the core linkage algorithm, developed within this study, were separately applied to the investigation of fine solute clustering of solute atoms in an Al-1.9Zn-1.7Mg (at.%) at two distinct states of early phase decomposition and the results of these analyses were evaluated.  相似文献   

10.
Drug–target interaction (DTI) prediction is a challenging step in further drug repositioning, drug discovery and drug design. The advent of high-throughput technologies brings convenience to the development of DTI prediction methods. With the generation of a high number of data sets, many mathematical models and computational algorithms have been developed to identify the potential drug–target pairs. However, most existing methods are proposed based on the single view data. By integrating the drug and target data from different views, we aim to get more stable and accurate prediction results.In this paper, a multiview DTI prediction method based on clustering is proposed. We first introduce a model for single view drug–target data. The model is formulated as an optimization problem, which aims to identify the clusters in both drug similarity network and target protein similarity network, and at the same time make the clusters with more known DTIs be connected together. Then the model is extended to multiview network data by maximizing the consistency of the clusters in each view. An approximation method is proposed to solve the optimization problem. We apply the proposed algorithms to two views of data. Comparisons with some existing algorithms show that the multiview DTI prediction algorithm can produce more accurate predictions. For the considered data set, we finally predict 54 possible DTIs. From the similarity analysis of the drugs/targets, enrichment analysis of DTIs and genes in each cluster, it is shown that the predicted DTIs have a high possibility to be true.  相似文献   

11.
An ant colony approach for clustering   总被引:2,自引:0,他引:2  
This paper presents an ant colony optimization methodology for optimally clustering N objects into K clusters. The algorithm employs distributed agents which mimic the way real ants find a shortest path from their nest to food source and back. This algorithm has been implemented and tested on several simulated and real datasets. The performance of this algorithm is compared with other popular stochastic/heuristic methods viz. genetic algorithm, simulated annealing and tabu search. Our computational simulations reveal very encouraging results in terms of the quality of solution found, the average number of function evaluations and the processing time required.  相似文献   

12.
Four different two-dimensional fingerprint types (MACCS, Unity, BCI, and Daylight) and nine methods of selecting optimal cluster levels from the output of a hierarchical clustering algorithm were evaluated for their ability to select clusters that represent chemical series present in some typical examples of chemical compound data sets. The methods were evaluated using a Ward's clustering algorithm on subsets of the publicly available National Cancer Institute HIV data set, as well as with compounds from our corporate data set. We make a number of observations and recommendations about the choice of fingerprint type and cluster level selection methods for use in this type of clustering  相似文献   

13.
Airborne particulate matter is an important component of atmospheric pollution, affecting human health, climate, and visibility. Modern instruments allow single particles to be analyzed one-by-one in real time, and offer the promise of determining the sources of individual particles based on their mass spectral signatures. The large number of particles to be apportioned makes clustering a necessary step. The goal of this study is to compare using mass spectral data the accuracy and speed of several clustering algorithms: ART-2a, several variants of hierarchical clustering, and K-means. Repeated simulations with various algorithms and different levels of data preprocessing suggest that hierarchical clustering methods using derivatives of Ward's algorithm discriminate sources with fewer errors than ART-2a, which itself discriminates much better than point-wise hierarchical clustering methods. In most cases, K-means algorithms do almost as well as the best hierarchical clustering. These efficient algorithms (clustering derived from Ward's algorithm, ART-2a and K-means) are most accurate when the relative peak areas have been pre-scaled by taking the square root. Analysis times vary within a factor of 30, and when accuracy above 95% is required, run times scale up as the square of the number of particles. Algorithms derived from Ward's remain the most accurate under a wide range of conditions and conversely, for an equal accuracy, can deliver a shorter list of clusters, allowing faster and maybe on-the-fly classification.  相似文献   

14.
Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached.  相似文献   

15.
This paper compares the performance of two clustering methods; DPClus graph clustering and hierarchical clustering to classify volatile organic compounds (VOCs) using fingerprint-based similarity measure between chemical structures. The clustering results from each method were compared to determine the degree of cluster overlap and how well it classified chemical structures of VOCs into clusters. Additionally, we also point out the advantages and limitations of both clustering methods. In conclusion, chemical similarity measure can be used to predict biological activities of a compound and this can be applied in the medical, pharmaceutical and agrotechnology fields.  相似文献   

16.
This paper describes an efficient algorithm based on a new concept called gene team for detecting conserved gene clusters among an arbitrary number of chromosomes. Within the clusters, neither the order of the genes nor their orientation need be conserved. In addition, insertion of foreign genes within the clusters are permitted to a user-defined extent. This algorithm has been implemented in a publicly available TEAM software that proves to be an efficient tool for systematic searches of conserved gene clusters. Examples of actual biological results are provided. The software is downloadable from http://www-igm.univ-mlv.fr/ approximately raffinot/geneteam.html.  相似文献   

17.
This paper describes the first application of fuzzy c-means clustering for the selection of representatives from assemblies of conformations or alignments. In case of alignments, their quality is taken into account using a weighted c-means scheme, developed in this work. The performance of fuzzy cluster validity measures, such as compactness, partition function, and entropy, are studied on several examples, but the visual 3D representation of data points is shown to be most beneficial in determining the optimum number of clusters. Fuzzy clustering is expected to perform better than crisp clustering methods in cases where there are a significant number of "outliers", such as in molecular dynamics simulations and molecular alignments.  相似文献   

18.
PK-means: A new algorithm for gene clustering   总被引:3,自引:0,他引:3  
Microarray technology has been widely applied in study of measuring gene expression levels for thousands of genes simultaneously. Gene cluster analysis is found useful for discovering the function of gene because co-expressed genes are likely to share the same biological function. K-means is one of well-known clustering methods. However, it is sensitive to the selection of an initial clustering and easily becoming trapped in a local minimum. Particle-pair optimizer (PPO) is a variation on the traditional particle swarm optimization (PSO) algorithm, which is stochastic particle-pair based optimization technique that can be applied to a wide range of problems. In this paper we bridges PPO and K-means within the algorithm PK-means for the first time. Our results indicate that PK-means clustering is generally more accurate than K-means and Fuzzy K-means (FKM). PK-means also has better robustness for it is less sensitive to the initial randomly selected cluster centroids. Finally, our algorithm outperforms these methods with fast convergence rate and low computation load.  相似文献   

19.
Clustering methods have been widely used to group together similar conformational states from molecular simulations of biomolecules in solution. For applications such as the interaction of a protein with a surface, the orientation of the protein relative to the surface is also an important clustering parameter because of its potential effect on adsorbed‐state bioactivity. This study presents cluster analysis methods that are specifically designed for systems where both molecular orientation and conformation are important, and the methods are demonstrated using test cases of adsorbed proteins for validation. Additionally, because cluster analysis can be a very subjective process, an objective procedure for identifying both the optimal number of clusters and the best clustering algorithm to be applied to analyze a given dataset is presented. The method is demonstrated for several agglomerative hierarchical clustering algorithms used in conjunction with three cluster validation techniques. © 2016 Wiley Periodicals, Inc.  相似文献   

20.
A hierarchical clustering algorithm--NIPALSTREE--was developed that is able to analyze large data sets in high-dimensional space. The result can be displayed as a dendrogram. At each tree level the algorithm projects a data set via principle component analysis onto one dimension. The data set is sorted according to this one dimension and split at the median position. To avoid distortion of clusters at the median position, the algorithm identifies a potentially more suited split point left or right of the median. The procedure is recursively applied on the resulting subsets until the maximal distance between cluster members exceeds a user-defined threshold. The approach was validated in a retrospective screening study for angiotensin converting enzyme (ACE) inhibitors. The resulting clusters were assessed for their purity and enrichment in actives belonging to this ligand class. Enrichment was observed in individual branches of the dendrogram. In further retrospective virtual screening studies employing the MDL Drug Data Report (MDDR), COBRA, and the SPECS catalog, NIPALSTREE was compared with the hierarchical k-means clustering approach. Results show that both algorithms can be used in the context of virtual screening. Intersecting the result lists obtained with both algorithms improved enrichment factors while losing only few chemotypes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号