首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Clustering is a popular data analysis and data mining technique. Since clustering problem have NP-complete nature, the larger the size of the problem, the harder to find the optimal solution and furthermore, the longer to reach a reasonable results. A popular technique for clustering is based on K-means such that the data is partitioned into K clusters. In this method, the number of clusters is predefined and the technique is highly dependent on the initial identification of elements that represent the clusters well. A large area of research in clustering has focused on improving the clustering process such that the clusters are not dependent on the initial identification of cluster representation. Another problem about clustering is local minimum problem. Although studies like K-Harmonic means clustering solves the initialization problem trapping to the local minima is still a problem of clustering. In this paper we develop a new algorithm for solving this problem based on a tabu search technique—Tabu K-Harmonic means (TabuKHM). The experiment results on the Iris and the other well known data, illustrate the robustness of the TabuKHM clustering algorithm.  相似文献   

2.
The K-means algorithm has been a widely applied clustering technique, especially in the area of marketing research. In spite of its popularity and ability to deal with large volumes of data quickly and efficiently, K-means has its drawbacks, such as its inability to provide good solution quality and robustness. In this paper, an extended study of the K-means algorithm is carried out. We propose a new clustering algorithm that integrates the concepts of hierarchical approaches and the K-means algorithm to yield improved performance in terms of solution quality and robustness. This proposed algorithm and score function are introduced and thoroughly discussed. Comparison studies with the K-means algorithm and three popular K-means initialization methods using five well-known test data sets are also presented. Finally, a business application involving segmenting credit card users demonstrates the algorithm's capability.  相似文献   

3.
An new initialization method for fuzzy c-means algorithm   总被引:1,自引:0,他引:1  
In this paper an initialization method for fuzzy c-means (FCM) algorithm is proposed in order to solve the two problems of clustering performance affected by initial cluster centers and lower computation speed for FCM. Grid and density are needed to extract approximate clustering center from sample space. Then, an initialization method for fuzzy c-means algorithm is proposed by using amount of approximate clustering centers to initialize classification number, and using approximate clustering centers to initialize initial clustering centers. Experiment shows that this method can improve clustering result and shorten clustering time validly.  相似文献   

4.
Fuzzy c-means clustering algorithm (FCM) can provide a non-parametric and unsupervised approach to the cluster analysis of data. Several efforts of fuzzy clustering have been undertaken by Bezdek and other researchers. Earlier studies in this field have reported problems due to the setting of optimum initial condition, cluster validity measure, and high computational load. More recently, the fuzzy clustering has benefited of a synergistic approach with Genetic Algorithms (GA) that play the role of an useful optimization technique that helps to better tolerate some classical drawbacks, such as sensitivity to initialization, noise and outliers, and susceptibility to local minima. We propose a genetic-level clustering methodology able to cluster objects represented by R p spaces. The unsupervised cluster algorithm, called SFCM (Spatial Fuzzy c-Means), is based on a fuzzy clustering c-means method that searches the best fuzzy partition of the universe assuming that the evaluation of each object with respect to some features is unknown, but knowing that it belongs to circular regions of R 2 space. Next we present a Java implementation of the algorithm, which provides a complete and efficient visual interaction for the setting of the parameters involved into the system. To demonstrate the applications of SFCM, we discuss a case study where it is shown the generality of our model by treating a simple 3-way data fuzzy clustering as example of a multicriteria optimization problem.  相似文献   

5.
Application of honey-bee mating optimization algorithm on clustering   总被引:4,自引:0,他引:4  
Cluster analysis is one of attractive data mining technique that use in many fields. One popular class of data clustering algorithms is the center based clustering algorithm. K-means used as a popular clustering method due to its simplicity and high speed in clustering large datasets. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. Over the last decade, modeling the behavior of social insects, such as ants and bees, for the purpose of search and problem solving has been the context of the emerging area of swarm intelligence. Honey-bees are among the most closely studied social insects. Honey-bee mating may also be considered as a typical swarm-based approach to optimization, in which the search algorithm is inspired by the process of marriage in real honey-bee. Honey-bee has been used to model agent-based systems. In this paper, we proposed application of honeybee mating optimization in clustering (HBMK-means). We compared HBMK-means with other heuristics algorithm in clustering, such as GA, SA, TS, and ACO, by implementing them on several well-known datasets. Our finding shows that the proposed algorithm works than the best one.  相似文献   

6.
Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.  相似文献   

7.
CGS算法是求解大型非对称线性方程组的常用算法,然而该算法无极小残差性质,因此它常因出现较大的中间剩余向量而出现典型的不规则收敛行为.本根据IRA方法提出了一种压缩预处理CGS方法,数值实验表明这种算法在一定程度上减小了迭代算法在收敛过程中的剩余问题,从而使得算法具有更好的稳定性,该法构造简单,减少了收敛次数,加快了收敛速度.  相似文献   

8.
Finite mixture models are well known for their flexibility in modeling heterogeneity in data. Model-based clustering is an important application of mixture models, which assumes that each mixture component distribution can adequately model a particular group of data. Unfortunately, when more than one component is needed for each group, the appealing one-to-one correspondence between mixture components and groups of data is ruined and model-based clustering loses its attractive interpretation. Several remedies have been considered in literature. We discuss the most promising recent results obtained in this area and propose a new algorithm that finds partitionings through merging mixture components relying on their pairwise overlap. The proposed technique is illustrated on a popular classification and several synthetic datasets, with excellent results.  相似文献   

9.
One of the most significant discussions in the field of machine learning today is on the clustering ensemble. The clustering ensemble combines multiple partitions generated by different clustering algorithms into a single clustering solution. Genetic algorithms are known for their high ability to solve optimization problems, especially the problem of the clustering ensemble. To date, despite the major contributions to find consensus cluster partitions with application of genetic algorithms, there has been little discussion on population initialization through generative mechanisms in genetic-based clustering ensemble algorithms as well as the production of cluster partitions with favorable fitness values in first phase clustering ensembles. In this paper, a threshold fuzzy C-means algorithm, named TFCM, is proposed to solve the problem of diversity of clustering, one of the most common problems in clustering ensembles. Moreover, TFCM is able to increase the fitness of cluster partitions, such that it improves performance of genetic-based clustering ensemble algorithms. The fitness average of cluster partitions generated by TFCM are evaluated by three different objective functions and compared against other clustering algorithms. In this paper, a simple genetic-based clustering ensemble algorithm, named SGCE, is proposed, in which cluster partitions generated by the TFCM and other clustering algorithms are used as the initial population used by the SGCE. The performance of the SGCE is evaluated and compared based on the different initial populations used. The experimental results based on eleven real world datasets demonstrate that TFCM improves the fitness of cluster partitions and that the performance of the SGCE is enhanced using initial populations generated by the TFCM.  相似文献   

10.
A modified approach had been developed in this study by combining two well-known algorithms of clustering, namely fuzzy c-means algorithm and entropy-based algorithm. Fuzzy c-means algorithm is one of the most popular algorithms for fuzzy clustering. It could yield compact clusters but might not be able to generate distinct clusters. On the other hand, entropy-based algorithm could obtain distinct clusters, which might not be compact. However, the clusters need to be both distinct as well as compact. The present paper proposes a modified approach of clustering by combining the above two algorithms. A genetic algorithm was utilized for tuning of all three clustering algorithms separately. The proposed approach was found to yield both distinct as well as compact clusters on two data sets.  相似文献   

11.
Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.  相似文献   

12.
Given a row-stochastic matrix describing pairwise similarities between data objects, spectral clustering makes use of the eigenvectors of this matrix to perform dimensionality reduction for clustering in fewer dimensions. One example from this class of algorithms is the Robust Perron Cluster Analysis (PCCA+), which delivers a fuzzy clustering. Originally developed for clustering the state space of Markov chains, the method became popular as a versatile tool for general data classification problems. The robustness of PCCA+, however, cannot be explained by previous perturbation results, because the matrices in typical applications do not comply with the two main requirements: reversibility and nearly decomposability. We therefore demonstrate in this paper that PCCA+ always delivers an optimal fuzzy clustering for nearly uncoupled, not necessarily reversible, Markov chains with transition states.  相似文献   

13.
Finite mixture models represent one of the most popular tools for modeling heterogeneous data. The traditional approach for parameter estimation is based on maximizing the likelihood function. Direct optimization is often troublesome due to the complex likelihood structure. The expectation–maximization algorithm proves to be an effective remedy that alleviates this issue. The solution obtained by this procedure is entirely driven by the choice of starting parameter values. This highlights the importance of an effective initialization strategy. Despite efforts undertaken in this area, there is no uniform winner found and practitioners tend to ignore the issue, often finding misleading or erroneous results. In this paper, we propose a simple yet effective tool for initializing the expectation–maximization algorithm in the mixture modeling setting. The idea is based on model averaging and proves to be efficient in detecting correct solutions even in those cases when competitors perform poorly. The utility of the proposed methodology is shown through comprehensive simulation study and applied to a well-known classification dataset with good results.  相似文献   

14.
Tag SNP selection is an important problem in genetic association studies. A class of algorithms to perform this task, among them a popular tool called Tagger, can be described as searching for a minimal vertex cover of a graph. In this article this approach is contrasted with a recently introduced clustering algorithm based on the graph theoretical concept of dominant sets. To compare the performance of both procedures comprehensive simulation studies have been performed using SNP data from the ten ENCODE regions included in the HapMap project. Quantitative traits have been simulated from additive models with a single causative SNP. Simulation results suggest that clustering performs always at least as good as Tagger, while in more than a third of the considered instances substantial improvement can be observed. Additionally an extension of the clustering algorithm is described which can be used for larger genomic data sets.  相似文献   

15.
Stochastic blockmodels and variants thereof are among the most widely used approaches to community detection for social networks and relational data. A stochastic blockmodel partitions the nodes of a network into disjoint sets, called communities. The approach is inherently related to clustering with mixture models; and raises a similar model selection problem for the number of communities. The Bayesian information criterion (BIC) is a popular solution, however, for stochastic blockmodels, the conditional independence assumption given the communities of the endpoints among different edges is usually violated in practice. In this regard, we propose composite likelihood BIC (CL-BIC) to select the number of communities, and we show it is robust against possible misspecifications in the underlying stochastic blockmodel assumptions. We derive the requisite methodology and illustrate the approach using both simulated and real data. Supplementary materials containing the relevant computer code are available online.  相似文献   

16.
Unsupervised classification is a highly important task of machine learning methods. Although achieving great success in supervised classification, support vector machine (SVM) is much less utilized to classify unlabeled data points, which also induces many drawbacks including sensitive to nonlinear kernels and random initializations, high computational cost, unsuitable for imbalanced datasets. In this paper, to utilize the advantages of SVM and overcome the drawbacks of SVM-based clustering methods, we propose a completely new two-stage unsupervised classification method with no initialization: a new unsupervised kernel-free quadratic surface SVM (QSSVM) model is proposed to avoid selecting kernels and related kernel parameters, then a golden-section algorithm is designed to generate the appropriate classifier for balanced and imbalanced data. By studying certain properties of proposed model, a convergent decomposition algorithm is developed to implement this non-covex QSSVM model effectively and efficiently (in terms of computational cost). Numerical tests on artificial and public benchmark data indicate that the proposed unsupervised QSSVM method outperforms well-known clustering methods (including SVM-based and other state-of-the-art methods), particularly in terms of classification accuracy. Moreover, we extend and apply the proposed method to credit risk assessment by incorporating the T-test based feature weights. The promising numerical results on benchmark personal credit data and real-world corporate credit data strongly demonstrate the effectiveness, efficiency and interpretability of proposed method, as well as indicate its significant potential in certain real-world applications.  相似文献   

17.
The field of cluster analysis is primarily concerned with the partitioning of data points into different clusters so as to optimize a certain criterion. Rapid advances in technology have made it possible to address clustering problems via optimization theory. In this paper, we present a global optimization algorithm to solve the fuzzy clustering problem, where each data point is to be assigned to (possibly) several clusters, with a membership grade assigned to each data point that reflects the likelihood of the data point belonging to that cluster. The fuzzy clustering problem is formulated as a nonlinear program, for which a tight linear programming relaxation is constructed via the Reformulation-Linearization Technique (RLT) in concert with additional valid inequalities. This construct is embedded within a specialized branch-and-bound (B&B) algorithm to solve the problem to global optimality. Computational experience is reported using several standard data sets from the literature as well as using synthetically generated larger problem instances. The results validate the robustness of the proposed algorithmic procedure and exhibit its dominance over the popular fuzzy c-means algorithmic technique and the commercial global optimizer BARON.  相似文献   

18.
The field of cluster analysis is primarily concerned with the sorting of data points into different clusters so as to optimize a certain criterion. Rapid advances in technology have made it possible to address clustering problems via optimization theory. In this paper, we present a global optimization algorithm to solve the hard clustering problem, where each data point is to be assigned to exactly one cluster. The hard clustering problem is formulated as a nonlinear program, for which a tight linear programming relaxation is constructed via the Reformulation-Linearization Technique (RLT) in concert with additional valid inequalities that serve to defeat the inherent symmetry in the problem. This construct is embedded within a specialized branch-and-bound algorithm to solve the problem to global optimality. Pertinent implementation issues that can enhance the efficiency of the branch-and-bound algorithm are also discussed. Computational experience is reported using several standard data sets found in the literature as well as using synthetically generated larger problem instances. The results validate the robustness of the proposed algorithmic procedure and exhibit its dominance over the popular k-means clustering technique. Finally, a heuristic procedure to obtain a good quality solution at a relative ease of computational effort is also described.  相似文献   

19.
A clustering methodology based on biological visual models that imitates how humans visually cluster data by spatially associating patterns has been recently proposed. The method is based on Cellular Neural Networks and some resolution adjustments. The Cellular Neural Network rebuilds low-density areas while different resolutions find the best clustering option. The algorithm has demonstrated good performance compared to other clustering techniques. However, its main drawbacks correspond to its inability to operate with more than two-dimensional data sets and the computational time required for the resolution adjustment mechanism. This paper proposes a new version of this clustering methodology to solve such flaws. In the new approach, a pre-processing stage is incorporated featuring a Self-Organization Map that maps complex high-dimensional relations into a reduced lattice yet preserving the topological organization of the initial data set. This reduced representation is employed as the two-dimensional data set for further processing. In the new version, the resolution adjustment process is also accelerated through the use of an optimization method that combines the Hill-Climbing and the Random Search techniques. By incorporating such mechanisms rather than evaluating all possible resolutions, the optimization strategy finds the best resolution for a clustering problem by using a limited number of iterations. The proposed approach has been evaluated, considering several two-dimensional and high-dimensional datasets. Experimental evidence exhibits that the proposed algorithm performs the clustering task over complex problems delivering a 46% faster on average than the original method. The approach is also compared to other popular clustering techniques reported in the literature. Computational experiments demonstrate competitive results in comparison to other algorithms in terms of accuracy and robustness.  相似文献   

20.
This paper addresses the problem of estimating a density, with either a compact support or a support bounded at only one end, exploiting a general and natural form of a finite mixture of distributions. Due to the importance of the concept of multimodality in the mixture framework, unimodal beta and gamma densities are used as mixture components, leading to a flexible modeling approach. Accordingly, a mode-based parameterization of the components is provided. A partitional clustering method, named $k$ -bumps, is also proposed; it is used as an ad hoc initialization strategy in the EM algorithm to obtain the maximum likelihood estimation of the mixture parameters. The performance of the $k$ -bumps algorithm as an initialization tool, in comparison to other common initialization strategies, is evaluated through some simulation experiments. Finally, two real applications are presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号