首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A mixture approach to clustering is an important technique in cluster analysis. A mixture of multivariate multinomial distributions is usually used to analyze categorical data with latent class model. The parameter estimation is an important step for a mixture distribution. Described here are four approaches to estimating the parameters of a mixture of multivariate multinomial distributions. The first approach is an extended maximum likelihood (ML) method. The second approach is based on the well-known expectation maximization (EM) algorithm. The third approach is the classification maximum likelihood (CML) algorithm. In this paper, we propose a new approach using the so-called fuzzy class model and then create the fuzzy classification maximum likelihood (FCML) approach for categorical data. The accuracy, robustness and effectiveness of these four types of algorithms for estimating the parameters of multivariate binomial mixtures are compared using real empirical data and samples drawn from the multivariate binomial mixtures of two classes. The results show that the proposed FCML algorithm presents better accuracy, robustness and effectiveness. Overall, the FCML algorithm has the superiority over the ML, EM and CML algorithms. Thus, we recommend FCML as another good tool for estimating the parameters of mixture multivariate multinomial models.  相似文献   

2.
Advances in Data Analysis and Classification - Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for...  相似文献   

3.
Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.  相似文献   

4.
This work develops a general procedure for clustering functional data which adapts the clustering method high dimensional data clustering (HDDC), originally proposed in the multivariate context. The resulting clustering method, called funHDDC, is based on a functional latent mixture model which fits the functional data in group-specific functional subspaces. By constraining model parameters within and between groups, a family of parsimonious models is exhibited which allow to fit onto various situations. An estimation procedure based on the EM algorithm is proposed for determining both the model parameters and the group-specific functional subspaces. Experiments on real-world datasets show that the proposed approach performs better or similarly than classical two-step clustering methods while providing useful interpretations of the groups and avoiding the uneasy choice of the discretization technique. In particular, funHDDC appears to always outperform HDDC applied on spline coefficients.  相似文献   

5.
We present a dual-view mixture model to cluster users based on their features and latent behavioral functions. Every component of the mixture model represents a probability density over a feature view for observed user attributes and a behavior view for latent behavioral functions that are indirectly observed through user actions or behaviors. Our task is to infer the groups of users as well as their latent behavioral functions. We also propose a non-parametric version based on a Dirichlet Process to automatically infer the number of clusters. We test the properties and performance of the model on a synthetic dataset that represents the participation of users in the threads of an online forum. Experiments show that dual-view models outperform single-view ones when one of the views lacks information.  相似文献   

6.
This study deals with the ordinal data in the performance analysis framework and provides a weight-restricted DEA model to obtain the preference score of each unit under assessment. The obtained scores are used to rank DMUs. Furthermore, to decrease the complexity of the provided model, the number of the constraints is decreased by some linear transformations.  相似文献   

7.
In Bayesian analysis of multidimensional scaling model with MCMC algorithm, we encounter the indeterminacy of rotation, reflection and translation of the parameter matrix of interest. This type of indeterminacy may be seen in other multivariate latent variable models as well. In this paper, we propose to address this indeterminacy problem with a novel, offline post-processing method that is easily implemented using easy-to-use Markov chain Monte Carlo (MCMC) software. Specifically, we propose a post-processing method based on the generalized extended Procrustes analysis to address this problem. The proposed method is compared with four existing methods to deal with indeterminacy thorough analyses of artificial as well as real datasets. The proposed method achieved at least as good a performance as the best existing method. The benefit of the offline processing approach in the era of easy-to-use MCMC software is discussed.  相似文献   

8.
Mixture model-based clustering, usually applied to multidimensional data, has become a popular approach in many data analysis problems, both for its good statistical properties and for the simplicity of implementation of the Expectation?CMaximization (EM) algorithm. Within the context of a railway application, this paper introduces a novel mixture model for dealing with time series that are subject to changes in regime. The proposed approach, called ClustSeg, consists in modeling each cluster by a regression model in which the polynomial coefficients vary according to a discrete hidden process. In particular, this approach makes use of logistic functions to model the (smooth or abrupt) transitions between regimes. The model parameters are estimated by the maximum likelihood method solved by an EM algorithm. This approach can also be regarded as a clustering approach which operates by finding groups of time series having common changes in regime. In addition to providing a time series partition, it therefore provides a time series segmentation. The problem of selecting the optimal numbers of clusters and segments is solved by means of the Bayesian Information Criterion. The ClustSeg approach is shown to be efficient using a variety of simulated time series and real-world time series of electrical power consumption from rail switching operations.  相似文献   

9.
This paper studies case deletion diagnostics for multilevel models. Using subset deletion, diagnostic measures for identifying influential units at any level are developed for both fixed and random parameters. Two approximate update formulae are derived. The first formula uses one-step approximation, while the second formula also includes the impact of estimating the random parameter. Two examples are used to illustrate the methodology developed.  相似文献   

10.
Given a row-stochastic matrix describing pairwise similarities between data objects, spectral clustering makes use of the eigenvectors of this matrix to perform dimensionality reduction for clustering in fewer dimensions. One example from this class of algorithms is the Robust Perron Cluster Analysis (PCCA+), which delivers a fuzzy clustering. Originally developed for clustering the state space of Markov chains, the method became popular as a versatile tool for general data classification problems. The robustness of PCCA+, however, cannot be explained by previous perturbation results, because the matrices in typical applications do not comply with the two main requirements: reversibility and nearly decomposability. We therefore demonstrate in this paper that PCCA+ always delivers an optimal fuzzy clustering for nearly uncoupled, not necessarily reversible, Markov chains with transition states.  相似文献   

11.
Variable selection is an important problem for cluster analysis of high-dimensional data. It is also a difficult one. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the “best” clustering may be misguided. It makes more sense to identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture models and demonstrate its ability to automatically identify natural facets of data and cluster data along each of those facets simultaneously. We present empirical results to show that facet determination usually leads to better clustering results than variable selection.  相似文献   

12.
The problem of aggregating a set of ordinal rankings of n alternatives has given rise to a number of consensus models. Among the most common of these models are those due to Borda and Kendall, which amount to using average ranks, and the ℓ1 and ℓ2 distance models. A common criticism of these approaches is their use of ordinal rank position numbers directly as the values of being ranked at those levels. This paper presents a general framework for associating value or worth with ordinal ranks, and develops models for deriving a consensus based on this framework. It is shown that the ℓp distance models using this framework are equivalent to the conventional ordinal models for any p ⩾ 1. This observation can be seen as a form of validation of the practice of using ordinal data in a manner for which it was presumably not designed. In particular, it establishes the robustness of the simple Borda, Kendall and median ranking models.  相似文献   

13.
14.
This paper examines the problem of aggregating ordinal preferences on a set of alternatives into a consensus. This problem has been the subject of study for more than two centuries and many procedures have been developed to create a compromise or consensus.We examine a variety of structures for preference specification, and in each case review the related models for deriving a consensus. Two classes of consensus models are discussed, namely ad hoc methods, evolving primarily from parliamentary settings over the past 200 years, and distance or axiomatic-based methods. We demonstrate the levels of complexity of the various distance-based models by presenting the related mathematical programming formulations for them. We also present conditions for equivalence, that is, for yielding the same consensus ranking for some of the methods. Finally, we discuss various extensions of the basic ordinal ranking structures, paying specific attention to partial ranking, voting member weighted consensus, ranking with intensity of preference, and rank correlation methods, as alternative approaches to deriving a consensus. Suggestions for future research directions are given.  相似文献   

15.
Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.  相似文献   

16.

A novel criterion for estimating a latent partition of the observed groups based on the output of a hierarchical model is presented. It is based on a loss function combining the Gini income inequality ratio and the predictability index of Goodman and Kruskal in order to achieve maximum heterogeneity of random effects across groups and maximum homogeneity of predicted probabilities inside estimated clusters. The index is compared with alternative approaches in a simulation study and applied in a case study concerning the role of hospital level variables in deciding for a cesarean section.

  相似文献   

17.
Data in social and behavioral sciences are often hierarchically organized. Multilevel statistical methodology was developed to analyze such data. Most of the procedures for analyzing multilevel data are derived from maximum likelihood based on the normal distribution assumption. Standard errors for parameter estimates in these procedures are obtained from the corresponding information matrix. Because practical data typically contain heterogeneous marginal skewnesses and kurtoses, this paper studies how nonnormally distributed data affect the standard errors of parameter estimates in a two-level structural equation model. Specifically, we study how skewness and kurtosis in one level affect standard errors of parameter estimates within its level and outside its level. We also show that, parallel to asymptotic robustness theory in conventional factor analysis, conditions exist for asymptotic robustness of standard errors in a multilevel factor analysis model.  相似文献   

18.
The local clustering coefficients of preferential attachment models are analyzed. Previously, a general approach to preferential attachment was proposed (the PA-class was introduced); it was shown that the degree distribution in all models of the PA-class obeys a power law. The global clustering coefficient was also analyzed, and a lower bound for the mean local clustering coefficient was found. In the paper, new results are obtained by analyzing the local clustering coefficients of models of the PA-class. Namely, the behavior of the mean value C 2(n, d) of local clustering over vertices of degree d is studied.  相似文献   

19.
In CUB models the uncertainty of choice is explicitly modelled as a Combination of discrete Uniform and shifted Binomial random variables. The basic concept to model the response as a mixture of a deliberate choice of a response category and an uncertainty component that is represented by a uniform distribution on the response categories is extended to a much wider class of models. The deliberate choice can in particular be determined by classical ordinal response models as the cumulative and adjacent categories model. Then one obtains the traditional and flexible models as special cases when the uncertainty component is irrelevant. It is shown that the effect of explanatory variables is underestimated if the uncertainty component is neglected in a cumulative type mixture model. Visualization tools for the effects of variables are proposed and the modelling strategies are evaluated by use of real data sets. It is demonstrated that the extended class of models frequently yields better fit than classical ordinal response models without an uncertainty component.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号