首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 671 毫秒
1.
《Analytical letters》2012,45(17):2727-2738
The K-means algorithm has some limitations including dead-unit properties, heavy dependence on the initial choice of cluster centers, convergence to local optima, and sensitivity to the number of clusters. This paper presents an efficient algorithm that optimizes K-means clustering by a hybrid particle swarm algorithm. The modified discrete algorithm is used to select variables and is continuously applied to update cluster centers simultaneously. The nearest center classification is then employed to classify the test samples. The proposed algorithm was applied to discriminate various edible oil varieties by employing Fourier transform infrared spectroscopy. As a comparison, the common K-means clustering, principal component analysis, and partial least squares techniques were also applied to classify these edible oil samples. Results demonstrated that the proposed method is an accurate and rapid strategy for identifying edible oils.  相似文献   

2.
In this paper, the performance of new clustering methods such as Neural Gas (NG) and Growing Neural Gas (GNG) is compared with the K-means method for real and simulated data sets. Moreover, a new algorithm called growing K-means, GK, is introduced as the alternative to Neural Gas and Growing Neural Gas. It has small input requirements and is conceptually very simple. The GK leads to nearly optimal values of the cost function, and, contrary to K-means, it is independent of the initial data set partition. The incremental property of GK additionally helps to estimate the number of "natural" clusters in data, i.e., the well-separated groups of objects in the data space.  相似文献   

3.
Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.  相似文献   

4.
5.
Performing cluster analysis on molecular conformation is an important way to find the representative conformation in the molecular dynamics trajectories. Usually, it is a critical step for interpreting complex conformational changes or interaction mechanisms. As one of the density-based clustering algorithms, find density peaks (FDP) is an accurate and reasonable candidate for the molecular conformation clustering. However, facing the rapidly increasing simulation length due to the increase in computing power, the low computing efficiency of FDP limits its application potential. Here we propose a marginal extension to FDP named K-means find density peaks (KFDP) to solve the mass source consuming problem. In KFDP, the points are initially clustered by a high efficiency clustering algorithm, such as K-means. Cluster centers are defined as typical points with a weight which represents the cluster size. Then, the weighted typical points are clustered again by FDP, and then are refined as core, boundary, and redefined halo points. In this way, KFDP has comparable accuracy as FDP but its computational complexity is reduced from O\begin{document}$(n^2)$\end{document} to O\begin{document}$(n)$\end{document}. We apply and test our KFDP method to the trajectory data of multiple small proteins in terms of torsion angle, secondary structure or contact map. The comparing results with K-means and density-based spatial clustering of applications with noise show the validation of the proposed KFDP.  相似文献   

6.
Accelerated K-means clustering in metric spaces   总被引:1,自引:0,他引:1  
The K-means method is a popular technique for clustering data into k-partitions. In the adaptive form of the algorithm, Lloyds method, an iterative procedure alternately assigns cluster membership based on a set of centroids and then redefines the centroids based on the computed cluster membership. The most time-consuming part of this algorithm is the determination of which points being clustered belong to which cluster center. This paper discusses the use of the vantage-point tree as a method of more quickly assigning cluster membership when the points being clustered belong to intrinsically low- and medium-dimensional metric spaces. Results will be discussed from simulated data sets and real-world data in the clustering of molecular databases based upon physicochemical properties. Comparisons will be made to a highly optimized brute-force implementation of Lloyd's method and to other pruning strategies.  相似文献   

7.
A robust method was developed to cluster similar NMR spectra from partially purified extracts obtained from a range of marine sponges and a plant biota. The NMR data were acquired using microtiter plate NMR (VAST) in protonated solvents. A sample data set which contained several clusters was used to optimize the protocol. The evaluation of the robustness was performed using three different clustering methods: tree clustering analysis, K-means clustering and multidimensional scaling. These methods were compared for consistency using the sample data set and the optimized methodology was applied to clustering of a set of spectra from partially purified biota extracts.  相似文献   

8.
Clustering analysis of data from DNA microarray hybridization studies is an essential task for identifying biologically relevant groups of genes. Attribute cluster algorithm (ACA) has provided an attractive way to group and select meaningful genes. However, ACA needs much prior knowledge about the genes to set the number of clusters. In practical applications, if the number of clusters is misspecified, the performance of the ACA will deteriorate rapidly. We propose the Cooperative Competition Cluster Algorithm (CCCA) in this paper. In the algorithm, we assume that both cooperation and competition exist simultaneously between clusters in the process of clustering. By using this principle of Cooperative Competition, the number of clusters can be found in the process of clustering. Experimental results on a synthetic and gene expression data are demonstrated. The results show that CCCA can choose the number of clusters automatically and get excellent performance with respect to other competing methods.  相似文献   

9.
BackgroundIn psoriasis skin disease, psoriatic cells develop rapidly than the normal healthy cells. This speedy growth causes accumulation of dead skin cells on the skin’s surface, resulting in thick patches of red, dry, and itchy skin. This patches or psoriatic skin legions may exhibit similar characteristics as healthy skin, which makes lesion detection more challenging. However, for accurate disease diagnosis and severity detection, lesion segmentation has prime importance. In that context, our group had previously performed psoriasis lesion segmentation using the conventional clustering algorithm. However, it suffers from the constraint of falling into the local sub-optimal centroids of the clusters.ObjectiveThe main objective of this paper is to implement an optimal lesion segmentation technique with aims at global convergence by reducing the probability of trapping into the local optima. This has been achieved by integrating swarm intelligence based algorithms with conventional K-means and Fuzzy C-means (FCMs) clustering algorithms.MethodologyThere are a total of eight different suitable combinations of conventional clustering (i.e., K-means and Fuzzy C-means (FCMs)) and four swarm intelligence (SI) techniques (i.e., seeker optimization (SO), artificial bee colony (ABC), ant colony optimization (ACO) and particle swarm optimization (PSO)) have been implemented in this study. The experiments are performed on the dataset of 780 psoriasis images from 74 patients collected at Psoriasis Clinic and Research Centre, Psoriatreat, Pune, Maharashtra, India. In this study, we are employing swarm intelligence optimization techniques in combination with the conventional clustering algorithms to increase the probability of convergence to the optimal global solution and hence improved clustering and detection.ResultsThe performance has been quantified in terms of four indices, namely accuracy (A), sensitivity (SN), specificity (SP), and Jaccard index (JI). Among the eight different combinations of clustering and optimization techniques considered in this study, FCM + SO outperformed with mean JI = 0.83, mean A = 90.89, mean SN = 92.84, and mean SP = 88.27. FCM + SO found statistical significant than other approaches with 96.67 % of the reliability index.ConclusionThe results obtained reflect the superiority of the proposed techniques over conventional clustering techniques. Hence our research development will lead to an objective analysis for automatic, accurate, and quick diagnosis of psoriasis.  相似文献   

10.
We discuss the clustering of 234 environmental samples resulting from an extensive monitoring program concerning soil lead content, plant lead content, traffic density, and distance from the road at different sampling locations in former East Germany. Considering the structure of data and the unsatisfactory results obtained applying classical clustering and principal component analysis, it appeared evident that fuzzy clustering could be one of the best solutions. In the following order we used different fuzzy clustering algorithms, namely, the fuzzy c-means (FCM) algorithm, the Gustafson–Kessel (GK) algorithm, which may detect clusters of ellipsoidal shapes in data by introducing an adaptive distance norm for each cluster, and the fuzzy c-varieties (FCV) algorithm, which was developed for recognition of r-dimensional linear varieties in high-dimensional data (lines, planes or hyperplanes). Fuzzy clustering with convex combination of point prototypes and different multidimensional linear prototypes is also discussed and applied for the first time in analytical chemistry (environmetrics). The results obtained in this study show the advantages of the FCV and GK algorithms over the FCM algorithm. The performance of each algorithm is illustrated by graphs and evaluated by the values of some conventional cluster validity indices. The values of the validity indices are in very good agreement with the quality of the clustering results. Figure Projection of all samples on the plane defined by the membership degrees to cluster A2, and A4 obtained using Fuzzy c-varieties (FCV) algorithm (expression of objective function and distance enclosed)  相似文献   

11.
Recently we have proposed a new variable selection algorithm, based on clustering of variable concept (CLoVA) in classification problem. With the same idea, this new concept has been applied to a regression problem and then the obtained results have been compared with conventional variable selection strategies for PLS. The basic idea behind the clustering of variable is that, the instrument channels are clustered into different clusters via clustering algorithms. Then, the spectral data of each cluster are subjected to PLS regression. Different real data sets (Cargill corn, Biscuit dough, ACE QSAR, Soy, and Tablet) have been used to evaluate the influence of the clustering of variables on the prediction performances of PLS. Almost in the all cases, the statistical parameter especially in prediction error shows the superiority of CLoVA-PLS respect to other variable selection strategies. Finally the synergy clustering of variable (sCLoVA-PLS), which is used the combination of cluster, has been proposed as an efficient and modification of CLoVA algorithm. The obtained statistical parameter indicates that variable clustering can split useful part from redundant ones, and then based on informative cluster; stable model can be reached.  相似文献   

12.
Plant polyphenol oxidases (PPOs) are ubiquitous plastid-localized enzymes. A precise analysis of PPO function in plants has been complicated by the presence of several family members with immunological cross reactivity. Previously we reported the isolation of genomic clones coding for the seven members of the tomato (Solanum lycopersicum) PPO family (A, A', B, C, D, E, and F). Here we report the complex spatial and temporal expression of one of the members, PPO B. The PPO B promoter was sequenced and subjected to homology analysis. Sequence similarities were found to nucleotide sequences of genes encoding enzymes/proteins active in the following systems: phenylpropanoid biosynthesis, signal transduction and responsiveness to hormones and stresses, fruit and seed proteins/enzymes, and photosynthesis. Chimeric gene fusions were constructed linking PPO B 5' flanking regions to the reporter gene, b-glucuronidase (GUS). The resultant transgenic plants were histochemically analyzed for GUS activity in various vegetative and reproductive tissues, and evaluated for PPO B responsiveness to ethylene induction. It was shown that PPO B expression was tissue specific, developmentally regulated, ethylene induced, and localized predominantly to mitotic or apoptotic tissues.  相似文献   

13.
DNA arrays have become the immediate choice in the analysis of large-scale expression measurements. Understanding the expression pattern of genes provide functional information on newly identified genes by computational approaches. Gene expression pattern is an indicator of the state of the cell, and abnormal cellular states can be inferred by comparing expression profiles. Since co-regulated genes, and genes involved in a particular pathway, tend to show similar expression patterns, clustering expression patterns has become the natural method of choice to differentiate groups. However, most methods based on cluster analysis suffer from the usual problems (i) dead units, and (ii) the problem of determining the correct number of clusters (k) needed to classify the data. Selecting the k has been an open problem of pattern recognition and statistics for decades. Since clustering reveals similar patterns present in the data, fixing this number strongly influences the quality of the result. While there is no theoretical solution to this problem, the number of clusters can be decided by a heuristic clustering algorithm called rival penalized competitive learning (RPCL). We present a novel implementation of RPCL that transforms the correct number of clusters problem to the tractable problem of clustering based on the degree of similarity. This is biologically significant since our implementation clusters functionally co-regulated genes and genes that present similar patterns of expression. This new approach reveals potential genes that are co-involved in a biological process. This implementation of the RPCL algorithm is useful in differentiating groups involved in concerted functional regulation and helps to progressively home into patterns, which are closely similar.  相似文献   

14.
建立了一种基于不相交主成分分析(Disjoint PCA)和遗传算法(GA)的特征变量选择方法, 并用于从基因表达谱(Gene expression profiles)数据中识别差异表达的基因. 在该方法中, 用不相交主成分分析评估基因组在区分两类不同样品时的区分能力; 用GA寻找区分能力最强的基因组; 所识别基因的偶然相关性用统计方法评估. 由于该方法考虑了基因间的协同作用更接近于基因的生物过程, 从而使所识别的基因具有更好的差异表达能力. 将该方法应用于肝细胞癌(HCC)样品的基因芯片数据分析, 结果表明, 所识别的基因具有较强的区分能力, 优于常用的基因芯片显著性分析(Significance analysis of microarrays, SAM)方法.  相似文献   

15.
This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.  相似文献   

16.
Bonnier F  Byrne HJ 《The Analyst》2012,137(2):322-332
K-means clustering followed by Principal Component Analysis (PCA) is employed to analyse Raman spectroscopic maps of single biological cells. K-means clustering successfully identifies regions of cellular cytoplasm, nucleus and nucleoli, but the mean spectra do not differentiate their biochemical composition. The loadings of the principal components identified by PCA shed further light on the spectral basis for differentiation but they are complex and, as the number of spectra per cluster is imbalanced, particularly in the case of the nucleoli, the loadings under-represent the basis for differentiation of some cellular regions. Analysis of pure bio-molecules, both structurally and spectrally distinct, in the case of histone, ceramide and RNA, and similarly in the case of the proteins albumin, collagen and histone, show the relative strong representation of spectrally sharp features in the spectral loadings, and the systematic variation of the loadings as one cluster becomes reduced in number. The more complex cellular environment is simulated by weighted sums of spectra, illustrating that although the loading becomes increasingly complex; their origin in a weighted sum of the constituent molecular components is still evident. Returning to the cellular analysis, the number of spectra per cluster is artificially balanced by increasing the weighting of the spectra of smaller number clusters. While it renders the PCA loading more complex for the three-way analysis, a pair wise analysis illustrates clear differences between the identified subcellular regions, and notably the molecular differences between nuclear and nucleoli regions are elucidated. Overall, the study demonstrates how appropriate consideration of the data available can improve the understanding of the information delivered by PCA.  相似文献   

17.
A spectral clustering method is presented and applied to two-dimensional molecular structures, where it has been found particularly useful in the analysis of screening data. The method provides a means to quantify (1) the degree of intermolecular similarity within a cluster and (2) the contribution that the features of a molecule make to a cluster. In an application of the spectral clustering method to an example data set of 125 COX-2 inhibitors, these two criteria were used to place the molecules into clusters of chemically related two-dimensional structures.  相似文献   

18.
Clustering of gene expression data collected across time is receiving growing attention in the biological literature since time-course experiments allow one to understand dynamic biological processes and identify genes governed by the same processes. It is believed that genes demonstrating similar expression profiles over time might give an informative insight into how underlying biological mechanisms work. In this paper, we propose a method based on functional data analysis (FNDA) to cluster time-dependent gene expression profiles. Consideration of clustering problems using the FNDA setting provides ways to take time dependency into account by using basis function expansion to describe the partially observed curves. We also discuss how to choose the number of bases in the basis function expansion in FNDA. A synthetic cycle data and a real data are used to demonstrate the proposed method and some comparisons between the proposed and existing approaches using the adjusted Rand indices are made.  相似文献   

19.
The fuzzy C‐means (FCM) algorithm does not fully utilize the spatial information for image segmentation and is sensitive to the presence of noise and intensity inhomogeneity in magnetic resonance imaging (MRI) images. The underlying reason is that using a single fuzzy membership function the FCM algorithm cannot properly represent pattern associations to all clusters. In this paper, we present a modified FCM (mFCM) algorithm by incorporating scale control spatial information for segmentation of MRI images in the presence of high levels of noise and intensity inhomogeneity. The algorithm utilizes scale controlled spatial information from the neighbourhood of each pixel under consideration in the form of a probability function. Using this probability function, a local membership function is introduced for each pixel. Finally, new clustering centre and weighted joint membership functions are introduced based on the local membership and global membership functions. The resulting mFCM algorithm is robust to the noise and intensity inhomogeneity in MRI image data and thereby improves the segmentation results. The experimental results on a synthetic image, four volumes of simulated and one volume of real‐patient MRI brain images show that the mFCM algorithm outperforms k‐means, FCM and some other recently proposed FCM‐based algorithms for image segmentation in terms of qualitative and quantitative studies such as cluster validity functions, segmentation accuracy and tissue segmentation accuracy. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can lead either to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an important component for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabu search (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure enables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three different microarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号