首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.  相似文献   

2.
The use of a finite mixture of normal distributions in model-based clustering allows us to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework, we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior, where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows us to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows us to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semiparametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark datasets. Supplementary materials for this article are available online.  相似文献   

3.
Exact global optimization of the clusterwise regression problem is challenging and there are currently no published feasible methods for performing this clustering optimally, even though it has been over thirty years since its original proposal. This work explores global optimization of the clusterwise regression problem using mathematical programming and related issues. A mixed logical-quadratic programming formulation with implication of constraints is presented and contrasted against a quadratic formulation based on the traditional big-M, which cannot guarantee optimality because the regression line coefficients, and thus errors, may be arbitrarily large. Clusterwise regression optimization times and solution optimality for two clusters are empirically tested on twenty real datasets and three series of synthetic datasets ranging from twenty to one hundred observations and from two to ten independent variables. Additionally, a few small real datasets are clustered into three lines.  相似文献   

4.
There are many data clustering techniques available to extract meaningful information from real world data, but the obtained clustering results of the available techniques, running time for the performance of clustering techniques in clustering real world data are highly important. This work is strongly felt that fuzzy clustering technique is suitable one to find meaningful information and appropriate groups into real world datasets. In fuzzy clustering the objective function controls the groups or clusters and computation parts of clustering. Hence researchers in fuzzy clustering algorithm aim is to minimize the objective function that usually has number of computation parts, like calculation of cluster prototypes, degree of membership for objects, computation part for updating and stopping algorithms. This paper introduces some new effective fuzzy objective functions with effective fuzzy parameters that can help to minimize the running time and to obtain strong meaningful information or clusters into the real world datasets. Further this paper tries to introduce new way for predicting membership, centres by minimizing the proposed new fuzzy objective functions. And experimental results of proposed algorithms are given to illustrate the effectiveness of proposed methods.  相似文献   

5.
Convex clustering, a convex relaxation of k-means clustering and hierarchical clustering, has drawn recent attentions since it nicely addresses the instability issue of traditional nonconvex clustering methods. Although its computational and statistical properties have been recently studied, the performance of convex clustering has not yet been investigated in the high-dimensional clustering scenario, where the data contains a large number of features and many of them carry no information about the clustering structure. In this article, we demonstrate that the performance of convex clustering could be distorted when the uninformative features are included in the clustering. To overcome it, we introduce a new clustering method, referred to as Sparse Convex Clustering, to simultaneously cluster observations and conduct feature selection. The key idea is to formulate convex clustering in a form of regularization, with an adaptive group-lasso penalty term on cluster centers. To optimally balance the trade-off between the cluster fitting and sparsity, a tuning criterion based on clustering stability is developed. Theoretically, we obtain a finite sample error bound for our estimator and further establish its variable selection consistency. The effectiveness of the proposed method is examined through a variety of numerical experiments and a real data application. Supplementary material for this article is available online.  相似文献   

6.
Cluster analysis, the determination of natural subgroups in a data set, is an important statistical methodology that is used in many contexts. A major problem with hierarchical clustering methods used today is the tendency for classification errors to occur when the empirical data departs from the ideal conditions of compact isolated clusters. Many empirical data sets have structural imperfections that confound the identification of clusters. We use a Self Organizing Map (SOM) neural network clustering methodology and demonstrate that it is superior to the hierarchical clustering methods. The performance of the neural network and seven hierarchical clustering methods is tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and nonuniform cluster densities. The superior accuracy and robustness of the neural network can improve the effectiveness of decisions and research based on clustering messy empirical data.  相似文献   

7.
Application of honey-bee mating optimization algorithm on clustering   总被引:4,自引:0,他引:4  
Cluster analysis is one of attractive data mining technique that use in many fields. One popular class of data clustering algorithms is the center based clustering algorithm. K-means used as a popular clustering method due to its simplicity and high speed in clustering large datasets. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. Over the last decade, modeling the behavior of social insects, such as ants and bees, for the purpose of search and problem solving has been the context of the emerging area of swarm intelligence. Honey-bees are among the most closely studied social insects. Honey-bee mating may also be considered as a typical swarm-based approach to optimization, in which the search algorithm is inspired by the process of marriage in real honey-bee. Honey-bee has been used to model agent-based systems. In this paper, we proposed application of honeybee mating optimization in clustering (HBMK-means). We compared HBMK-means with other heuristics algorithm in clustering, such as GA, SA, TS, and ACO, by implementing them on several well-known datasets. Our finding shows that the proposed algorithm works than the best one.  相似文献   

8.
In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets.  相似文献   

9.
Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been developed and deployed, most of them still suffer from the lack of automatic or online decision for optimal number of clusters. In this paper, we define clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering configurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods.  相似文献   

10.
A k-means-type algorithm is proposed for efficiently clustering data constrained to lie on the surface of a p-dimensional unit sphere, or data that are mean-zero-unit-variance standardized observations such as those that occur when using Euclidean distance to cluster time series gene expression data using a correlation metric. We also provide methodology to initialize the algorithm and to estimate the number of clusters in the dataset. Results from a detailed series of experiments show excellent performance, even with very large datasets. The methodology is applied to the analysis of the mitotic cell division cycle of budding yeast dataset of Cho et al. [Molecular Cell (1998), 2, 65–73]. The entire dataset has not been analyzed previously, so our analysis provides an understanding for the complete set of genes acting in concert and differentially. We also use our methodology on the submitted abstracts of oral presentations made at the 2008 Joint Statistical Meetings (JSM) to identify similar topics. Our identified groups are both interpretable and distinct and the methodology provides a possible automated tool for efficient parallel scheduling of presentations at professional meetings.

The supplemental materials described in the article are available in the online supplements.  相似文献   

11.
An appropriate distance is an essential ingredient in various real-world learning tasks. Distance metric learning proposes to study a metric, which is capable of reflecting the data configuration much better in comparison with the commonly used methods. We offer an algorithm for simultaneous learning the Mahalanobis like distance and K-means clustering aiming to incorporate data rescaling and clustering so that the data separability grows iteratively in the rescaled space with its sequential clustering. At each step of the algorithm execution, a global optimization problem is resolved in order to minimize the cluster distortions resting upon the current cluster configuration. The obtained weight matrix can also be used as a cluster validation characteristic. Namely, closeness of such matrices learned during a sample process can indicate the clusters readiness; i.e. estimates the true number of clusters. Numerical experiments performed on synthetic and on real datasets verify the high reliability of the proposed method.  相似文献   

12.
We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of datameasurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.  相似文献   

13.
In this article, we present a randomized dynamic cluster algorithm for large data sets. It is based on the restricted random walk cluster algorithm by Schöll and Schöll-Paschinger that has given good results in past studies. We discuss different approaches for the clustering of dynamic data sets. In contrast to most of these methods, dynamic restricted random walk clustering is also efficient for a small percentage of changes in the data set and has the additional advantage that the updates asymptotically produce the same clusters as a reclustering with the static variant; there is thus no need for any reclustering ever. In addition, the method has a relatively low computational complexity which enables it to cluster large data sets.  相似文献   

14.
Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.  相似文献   

15.
Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing.  相似文献   

16.
Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.  相似文献   

17.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

18.
Abstract

The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxima, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged on an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.  相似文献   

19.
We discuss the relation between classes and clusters in datasets with given classes. We examine the distribution of classes within obtained clusters, using different clustering methods which are based on different techniques. We also study the structure of the obtained clusters. One of the main conclusions, obtained in this research is that the notion purity cannot be always used for evaluation of accuracy of clustering techniques.  相似文献   

20.
Variational approximations have the potential to scale Bayesian computations to large datasets and highly parameterized models. Gaussian approximations are popular, but can be computationally burdensome when an unrestricted covariance matrix is employed and the dimension of the model parameter is high. To circumvent this problem, we consider a factor covariance structure as a parsimonious representation. General stochastic gradient ascent methods are described for efficient implementation, with gradient estimates obtained using the so-called “reparameterization trick.” The end result is a flexible and efficient approach to high-dimensional Gaussian variational approximation. We illustrate using robust P-spline regression and logistic regression models. For the latter, we consider eight real datasets, including datasets with many more covariates than observations, and another with mixed effects. In all cases, our variational method provides fast and accurate estimates. Supplementary material for this article is available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号