首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 687 毫秒
1.
Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters’ minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters.  相似文献   

2.
Fixed effects models are very flexible because they do not make assumptions on the distribution of effects and can also be used if the heterogeneity component is correlated with explanatory variables. A disadvantage is the large number of effects that have to be estimated. A recursive partitioning (or tree based) method is proposed that identifies clusters of units that share the same effect. The approach reduces the number of parameters to be estimated and is useful in particular if one is interested in identifying clusters with the same effect on a response variable. It is shown that the method performs well and outperforms competitors like the finite mixture model in particular if the heterogeneity component is correlated with explanatory variables. In two applications the usefulness of the approach to identify clusters that share the same effect is illustrated. Supplementary materials for this article are available online.  相似文献   

3.
In this article, we propose a novel Bayesian nonparametric clustering algorithm based on a Dirichlet process mixture of Dirichlet distributions which have been shown to be very flexible for modeling proportional data. The idea is to let the number of mixture components increases as new data to cluster arrive in such a manner that the model selection problem (i.e. determination of the number of clusters) can be answered without recourse to classic selection criteria. Thus, the proposed model can be considered as an infinite Dirichlet mixture model. An expectation propagation inference framework is developed to learn this model by obtaining a full posterior distribution on its parameters. Within this learning framework, the model complexity and all the involved parameters are evaluated simultaneously. To show the practical relevance and efficiency of our model, we perform a detailed analysis using extensive simulations based on both synthetic and real data. In particular, real data are generated from three challenging applications namely images categorization, anomaly intrusion detection and videos summarization.  相似文献   

4.
The use of a finite mixture of normal distributions in model-based clustering allows us to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework, we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior, where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows us to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows us to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semiparametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark datasets. Supplementary materials for this article are available online.  相似文献   

5.
In this paper we introduce a new method to the cluster analysis of longitudinal data focusing on the determination of uncertainty levels for cluster memberships. The method uses the Dirichlet-t distribution which notably utilizes the robustness feature of the student-t distribution in the framework of a Bayesian semi-parametric approach together with robust clustering of subjects evaluates the uncertainty level of subjects memberships to their clusters. We let the number of clusters and the uncertainty levels be unknown while fitting Dirichlet process mixture models. Two simulation studies are conducted to demonstrate the proposed methodology. The method is applied to cluster a real data set taken from gene expression studies.  相似文献   

6.
For several years, model-based clustering methods have successfully tackled many of the challenges presented by data-analysts. However, as the scope of data analysis has evolved, some problems may be beyond the standard mixture model framework. One such problem is when observations in a dataset come from overlapping clusters, whereby different clusters will possess similar parameters for multiple variables. In this setting, mixed membership models, a soft clustering approach whereby observations are not restricted to single cluster membership, have proved to be an effective tool. In this paper, a method for fitting mixed membership models to data generated by a member of an exponential family is outlined. The method is applied to count data obtained from an ultra running competition, and compared with a standard mixture model approach.  相似文献   

7.
In this article, we propose an improvement on the sequential updating and greedy search (SUGS) algorithm for fast fitting of Dirichlet process mixture models. The SUGS algorithm provides a means for very fast approximate Bayesian inference for mixture data which is particularly of use when datasets are so large that many standard Markov chain Monte Carlo (MCMC) algorithms cannot be applied efficiently, or take a prohibitively long time to converge. In particular, these ideas are used to initially interrogate the data, and to refine models such that one can potentially apply exact data analysis later on. SUGS relies upon sequentially allocating data to clusters and proceeding with an update of the posterior on the subsequent allocations and parameters which assumes this allocation is correct. Our modification softens this approach, by providing a probability distribution over allocations, with a similar computational cost; this approach has an interpretation as a variational Bayes procedure and hence we term it variational SUGS (VSUGS). It is shown in simulated examples that VSUGS can outperform, in terms of density estimation and classification, a version of the SUGS algorithm in many scenarios. In addition, we present a data analysis for flow cytometry data, and SNP data via a three-class Dirichlet process mixture model, illustrating the apparent improvement over the original SUGS algorithm.  相似文献   

8.
Poisson mixtures are usually used to describe overdispersed data. Finite Poisson mixtures are used in many practical situations where often it is of interest to determine the number of components in the mixture. Identifying how many components comprise a mixture remains a difficult problem. The likelihood ratio test (LRT) is a general statistical procedure to use. Unfortunately, a number of specific problems arise and the classical theory fails to hold. In this paper a new procedure is proposed that is based on testing whether a new component can be added to a finite Poisson mixture which eventually leads to the number of components in the mixture. It is a sequential testing procedure based on the well known LRT that utilises a resampling technique to construct the distribution of the test statistic. The application of the procedure to real data reveals some interesting features of the distribution of the test statistic.  相似文献   

9.

In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303–324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than K with high probability. The number of clusters is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration.

  相似文献   

10.
In model-based clustering, the density of each cluster is usually assumed to be a certain basic parametric distribution, for example, the normal distribution. In practice, it is often difficult to decide which parametric distribution is suitable to characterize a cluster, especially for multivariate data. Moreover, the densities of individual clusters may be multimodal themselves, and therefore cannot be accurately modeled by basic parametric distributions. This article explores a clustering approach that models each cluster by a mixture of normals. The resulting overall model is a multilayer mixture of normals. Algorithms to estimate the model and perform clustering are developed based on the classification maximum likelihood (CML) and mixture maximum likelihood (MML) criteria. BIC and ICL-BIC are examined for choosing the number of normal components per cluster. Experiments on both simulated and real data are presented.  相似文献   

11.
混合模型已成为数据分析中最流行的技术之一,由于拥有数学模型,它通常比聚类分析中的传统的方法产生的结果更精确,而关键因素是混合模型中子总体个数,它决定了数据分析的最终结果。期望最大化(EM)算法常用在混合模型的参数估计,以及机器学习和聚类领域中的参数估计中,是一种从不完全数据或者是有缺失值的数据中求解参数极大似然估计的迭代算法。学者们往往采用AIC和BIC的方法来确定子总体的个数,而这两种方法在实际的应用中的效果并不稳定,甚至可能会产生错误的结果。针对此问题,本文提出了一种利用似然函数的碎石图来确定混合模型中子总体的个数的新方法。实验结果表明,本文方法确定的子总体的个数在大部分理想的情况下可以得到与AIC、BIC方法确定的聚类个数相同的结果,而在一般的实际数据中或条件不理想的状态下,碎石图方法也可以得到更可靠的结果。随后,本文将新方法在选取的黄石公园喷泉数据的参数估计中进行了实际的应用。  相似文献   

12.
Fixed point clustering is a new stochastic approach to cluster analysis. The definition of a single fixed point cluster (FPC) is based on a simple parametric model, but there is no parametric assumption for the whole dataset as opposed to mixture modeling and other approaches. An FPC is defined as a data subset that is exactly the set of non-outliers with respect to its own parameter estimators. This paper concentrates upon the theoretical foundation of FPC analysis as a method for clusterwise linear regression, i.e., the single clusters are modeled as linear regressions with normal errors. In this setup, fixed point clustering is based on an iteratively reweighted estimation with zero weight for all outliers. FPCs are non-hierarchical, but they may overlap and include each other. A specification of the number of clusters is not needed. Consistency results are given for certain mixture models of interest in cluster analysis. Convergence of a fixed point algorithm is shown. Application to a real dataset shows that fixed point clustering can highlight some other interesting features of datasets compared to maximum likelihood methods in the presence of deviations from the usual assumptions of model based cluster analysis.  相似文献   

13.
Cluster analysis of genome-wide expression data from DNA microarray hybridization studies is a useful tool for identifying biologically relevant gene groupings (DeRisi et al. 1997; Weiler et al. 1997). It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. In this study, we describe a novel clustering algorithm framework based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search (Floudas et al. 1989; Floudas 1995), which includes a procedure to determine the optimal number of clusters to be used. The approach involves a pre-clustering of data points to define an initial number of clusters and the iterative solution of a Linear Programming problem (the primal problem) and a Mixed-Integer Linear Programming problem (the master problem), that are derived from a Mixed Integer Nonlinear Programming problem formulation. Badly placed data points are removed to form new clusters, thus ensuring tight groupings amongst the data points and incrementing the number of clusters until the optimum number is reached. We apply the proposed clustering algorithm to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces cerevisiae and compare the results to that obtained with some commonly used clustering algorithms. Our algorithm compares favorably against these algorithms in the aspects of intra-cluster similarity and inter-cluster dissimilarity, often considered two key tenets of clustering. Furthermore, our algorithm can predict the optimal number of clusters, and the biological coherence of the predicted clusters is analyzed through gene ontology.  相似文献   

14.
There exist many data clustering algorithms, but they can not adequately handle the number of clusters or cluster shapes. Their performance mainly depends on a choice of algorithm parameters. Our approach to data clustering and algorithm does not require the parameter choice; it can be treated as a natural adaptation to the existing structure of distances between data points. The outlier factor introduced by the author specifies a degree of being an outlier for each data point. The outlier factor notion is based on the difference between the frequency distribution of interpoint distances in a given dataset and the corresponding distribution of uniformly distributed points. Then data clusters can be determined by maximizing the outlier factor function. The data points in dataset are divided into clusters according to the attractor regions of local optima. An experimental evaluation of the proposed algorithm shows that the proposed method can identify complex cluster shapes. Key advantages of the approach are: good clustering properties for datasets with comparatively large amount of noise (an additional data points), and an absence of important parameters which adequate choice determines the quality of results.  相似文献   

15.
With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.  相似文献   

16.
This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the k-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately.  相似文献   

17.
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.  相似文献   

18.
19.
一种新的分类方法   总被引:5,自引:0,他引:5  
本文在属性聚类网络的基础上 ,提出了堆近邻分类方法 .通过将无监督的属性聚类加上有监督信息 ,能自适应地优选堆数 .样本所考察的近邻个数依据它所在的堆的大小 ,因而每个样本所考查的近邻的个数不是完全相等的 .这种方法可用到高维小样本的数据分类问题中 .我们将它应用到基因表达谱形式的癌症辩识问题中 ,结果表明分类性能得到了较大的提高  相似文献   

20.
A complex sequence of tests on components and the system is a part of many manufacturing processes. Statistical imperfect test and repair models can be used to derive the properties of such test sequences but require model parameters to be specified. We describe a technique for estimating such parameters from typical data that are available from past testing. A Gaussian mixture model is used to illustrate the approach and as a model that can represent the wide variety of statistical properties of test data, including outliers, multimodality and skewness. Model fitting was carried out using a Bayesian approach, implemented by MCMC. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号