首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

2.
The objective of this paper is to explore different modeling strategies to generate high-dimensional Bernoulli vectors. We discuss the multivariate Bernoulli (MB) distribution, probe its properties and examine three models for generating random vectors. A latent multivariate normal model whose bivariate distributions are approximated with Plackett distributions with univariate normal distributions is presented. A conditional mean model is examined where the conditional probability of success depends on previous history of successes. A mixture of beta distributions is also presented that expresses the probability of the MB vector as a product of correlated binary random variables. Each method has a domain of effectiveness. The latent model offers unpatterned correlation structures while the conditional mean and the mixture model provide computational feasibility for high-dimensional generation of MB vectors.  相似文献   

3.
All multivariate random variables with finite variances are univariate functions of uncorrelated random variables and if the multivariate distribution is absolutely continuous then these univariate functions are piecewise linear. They can be independent of the correlations in the Gaussian case.  相似文献   

4.
This article considers a graphical model for ordinal variables, where it is assumed that the data are generated by discretizing the marginal distributions of a latent multivariate Gaussian distribution. The relationships between these ordinal variables are then described by the underlying Gaussian graphical model and can be inferred by estimating the corresponding concentration matrix. Direct estimation of the model is computationally expensive, but an approximate EM-like algorithm is developed to provide an accurate estimate of the parameters at a fraction of the computational cost. Numerical evidence based on simulation studies shows the strong performance of the algorithm, which is also illustrated on datasets on movie ratings and an educational survey.  相似文献   

5.
Latent or unobserved phenomena pose a significant difficulty in data analysis as they induce complicated and confounding dependencies among a collection of observed variables. Factor analysis is a prominent multivariate statistical modeling approach that addresses this challenge by identifying the effects of (a small number of) latent variables on a set of observed variables. However, the latent variables in a factor model are purely mathematical objects that are derived from the observed phenomena, and they do not have any interpretation associated to them. A natural approach for attributing semantic information to the latent variables in a factor model is to obtain measurements of some additional plausibly useful covariates that may be related to the original set of observed variables, and to associate these auxiliary covariates to the latent variables. In this paper, we describe a systematic approach for identifying such associations. Our method is based on solving computationally tractable convex optimization problems, and it can be viewed as a generalization of the minimum-trace factor analysis procedure for fitting factor models via convex optimization. We analyze the theoretical consistency of our approach in a high-dimensional setting as well as its utility in practice via experimental demonstrations with real data.  相似文献   

6.
Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables.  相似文献   

7.
This article proposes a probability model for k-dimensional ordinal outcomes, that is, it considers inference for data recorded in k-dimensional contingency tables with ordinal factors. The proposed approach is based on full posterior inference, assuming a flexible underlying prior probability model for the contingency table cell probabilities. We use a variation of the traditional multivariate probit model, with latent scores that determine the observed data. In our model, a mixture of normals prior replaces the usual single multivariate normal model for the latent variables. By augmenting the prior model to a mixture of normals we generalize inference in two important ways. First, we allow for varying local dependence structure across the contingency table. Second, inference in ordinal multivariate probit models is plagued by problems related to the choice and resampling of cutoffs defined for these latent variables. We show how the proposed mixture model approach entirely removes these problems. We illustrate the methodology with two examples, one simulated dataset and one dataset of interrater agreement.  相似文献   

8.
Advances in Data Analysis and Classification - Finite mixtures of (multivariate) Gaussian distributions have broad utility, including their usage for model-based clustering. There is increasing...  相似文献   

9.
A clustering method is presented for analysing multivariate binary data with missing values. When not all values are observed, Govaert3 has studied the relations between clustering methods and statistical models. The author has shown how the identification of a mixture of Bernoulli distributions with the same parameter for all clusters and for all variables corresponds to a clustering criterion which uses L1 distance characterizing the MNDBIN method (Marchetti8). He first generalized this model by selecting parameters which can depend on variables and finally by selecting parameters which can depend both on variables and on clusters. We use the previous models to derive a clustering method adapted to missing data. This method optimizes a criterion by a standard iterative partitioning algorithm which removes the necessity either to ignore objects or to substitute the missing data. We study several versions of this algorithm and, finally, a brief account is given of the application of this method to some simulated data.  相似文献   

10.
Univariate or multivariate ordinal responses are often assumed to arise from a latent continuous parametric distribution, with covariate effects that enter linearly. We introduce a Bayesian nonparametric modeling approach for univariate and multivariate ordinal regression, which is based on mixture modeling for the joint distribution of latent responses and covariates. The modeling framework enables highly flexible inference for ordinal regression relationships, avoiding assumptions of linearity or additivity in the covariate effects. In standard parametric ordinal regression models, computational challenges arise from identifiability constraints and estimation of parameters requiring nonstandard inferential techniques. A key feature of the nonparametric model is that it achieves inferential flexibility, while avoiding these difficulties. In particular, we establish full support of the nonparametric mixture model under fixed cut-off points that relate through discretization the latent continuous responses with the ordinal responses. The practical utility of the modeling approach is illustrated through application to two datasets from econometrics, an example involving regression relationships for ozone concentration, and a multirater agreement problem. Supplementary materials with technical details on theoretical results and on computation are available online.  相似文献   

11.
Supervised clustering of variables   总被引:1,自引:0,他引:1  
In predictive modelling, highly correlated predictors lead to unstable models that are often difficult to interpret. The selection of features, or the use of latent components that reduce the complexity among correlated observed variables, are common strategies. Our objective with the new procedure that we advocate here is to achieve both purposes: to highlight the group structure among the variables and to identify the most relevant groups of variables for prediction. The proposed procedure is an iterative adaptation of a method developed for the clustering of variables around latent variables (CLV). Modification of the standard CLV algorithm leads to a supervised procedure, in the sense that the variable to be predicted plays an active role in the clustering. The latent variables associated with the groups of variables, selected for their “proximity” to the variable to be predicted and their “internal homogeneity”, are progressively added in a predictive model. The features of the methodology are illustrated based on a simulation study and a real-world application.  相似文献   

12.
We propose a new model for cluster analysis in a Bayesian nonparametric framework. Our model combines two ingredients, species sampling mixture models of Gaussian distributions on one hand, and a deterministic clustering procedure (DBSCAN) on the other. Here, two observations from the underlying species sampling mixture model share the same cluster if the distance between the densities corresponding to their latent parameters is smaller than a threshold; this yields a random partition which is coarser than the one induced by the species sampling mixture. Since this procedure depends on the value of the threshold, we suggest a strategy to fix it. In addition, we discuss implementation and applications of the model; comparison with more standard clustering algorithms will be given as well. Supplementary materials for the article are available online.  相似文献   

13.
We propose a Bayesian approach for inference in the multivariate probit model, taking into account the association structure between binary observations. We model the association through the correlation matrix of the latent Gaussian variables. Conditional independence is imposed by setting some off-diagonal elements of the inverse correlation matrix to zero and this sparsity structure is modeled using a decomposable graphical model. We propose an efficient Markov chain Monte Carlo algorithm relying on a parameter expansion scheme to sample from the resulting posterior distribution. This algorithm updates the correlation matrix within a simple Gibbs sampling framework and allows us to infer the correlation structure from the data, generalizing methods used for inference in decomposable Gaussian graphical models to multivariate binary observations. We demonstrate the performance of this model and of the Markov chain Monte Carlo algorithm on simulated and real datasets. This article has online supplementary materials.  相似文献   

14.
Abstract

This article proposes an algorithm for generating over-dispersed and under-dispersed binomial variates with specified mean and variance. The over-dispersed/under-dispersed distributions are derived from correlated binary variables with an underlying continuous multivariate distribution. Different multivariate distributions or different correlation matrices result in different over-dispersed (or under-dispersed) distributions. The over-dispersed binomial distributions that are generated from three different correlation matrices of a multivariate normal are compared with the beta-binomial distribution for various mean and over-dispersion parameters by quantile-quantile (Q-Q) plots. The two distributions appear to be similar. The under-dispersed binomial distribution is simulated to model an example data set that exhibits under-dispersed binomial variation.  相似文献   

15.
This paper proposes a new methodology to model uncertainties associated with functional random variables. This methodology allows to deal simultaneously with several dependent functional variables and to address the specific case where these variables are linked to a vectorial variable, called covariate. In this case, the proposed uncertainty modelling methodology has two objectives: to retain both the most important features of the functional variables and their features which are the most correlated to the covariate. This methodology is composed of two steps. First, the functional variables are decomposed on a functional basis. To deal simultaneously with several dependent functional variables, a Simultaneous Partial Least Squares algorithm is proposed to estimate this basis. Second, the joint probability density function of the coefficients selected in the decomposition is modelled by a Gaussian mixture model. A new sparse method based on a Lasso penalization algorithm is proposed to estimate the Gaussian mixture model parameters and reduce their number. Several criteria are introduced to assess the methodology performance: its ability to approximate the functional variables probability distribution, their dependence structure and their features which explain the covariate. Finally, the whole methodology is applied on a simulated example and on a nuclear reliability test case.  相似文献   

16.
A mixture approach to clustering is an important technique in cluster analysis. A mixture of multivariate multinomial distributions is usually used to analyze categorical data with latent class model. The parameter estimation is an important step for a mixture distribution. Described here are four approaches to estimating the parameters of a mixture of multivariate multinomial distributions. The first approach is an extended maximum likelihood (ML) method. The second approach is based on the well-known expectation maximization (EM) algorithm. The third approach is the classification maximum likelihood (CML) algorithm. In this paper, we propose a new approach using the so-called fuzzy class model and then create the fuzzy classification maximum likelihood (FCML) approach for categorical data. The accuracy, robustness and effectiveness of these four types of algorithms for estimating the parameters of multivariate binomial mixtures are compared using real empirical data and samples drawn from the multivariate binomial mixtures of two classes. The results show that the proposed FCML algorithm presents better accuracy, robustness and effectiveness. Overall, the FCML algorithm has the superiority over the ML, EM and CML algorithms. Thus, we recommend FCML as another good tool for estimating the parameters of mixture multivariate multinomial models.  相似文献   

17.
A common practice in customer satisfaction analysis is to administer surveys where subjects are asked to express opinions on a number of statements, or satisfaction scales, by use of ordered categorical responses. Motivated by this application, we propose a pseudo‐likelihood approach to estimate the dependence structure among multivariate categorical variables. As it is commonly carried out in this area, we assume that the responses are related to latent continuous variables that are truncated to induce categorical responses. A Gaussian likelihood is assumed for the latent variables leading to the so‐called ordered probit model. Because the calculation of the exact likelihood is computationally demanding, we adopt an approximate solution based on pairwise likelihood. To asses the performance of the approach, simulation studies are conducted comparing the proposed method with standard likelihood methods. A parametric bootstrap approach to evaluate the variance of the maximum pairwise likelihood estimator is proposed and discussed. An application to customer satisfaction survey is performed showing the effectiveness of the approach in the presence of covariates and under other generalizations of the model. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

18.
This paper presents a finite mixture of multivariate betas as a new model-based clustering method tailored to applications where the feature space is constrained to the unit hypercube. The mixture component densities are taken to be conditionally independent, univariate unimodal beta densities (from the subclass of reparameterized beta densities given by Bagnato and Punzo in Comput Stat 28(4):10.1007/s00180-012-367-4, 2013). The EM algorithm used to fit this mixture is discussed in detail, and results from both this beta mixture model and the more standard Gaussian model-based clustering are presented for simulated skill mastery data from a common cognitive diagnosis model and for real data from the Assistment System online mathematics tutor (Feng et al. in J User Model User Adap Inter 19(3):243–266, 2009). The multivariate beta mixture appears to outperform the standard Gaussian model-based clustering approach, as would be expected on the constrained space. Fewer components are selected (by BIC-ICL) in the beta mixture than in the Gaussian mixture, and the resulting clusters seem more reasonable and interpretable.  相似文献   

19.
We study the exact distribution of linear combinations of order statistics of arbitrary (absolutely continuous) dependent random variables. In particular, we examine the case where the random variables have a joint elliptically contoured distribution and the case where the random variables are exchangeable. We investigate also the particular L-statistics that simply yield a set of order statistics, and study their joint distribution. We present the application of our results to genetic selection problems, design of cellular phone receivers, and visual acuity. We give illustrative examples based on the multivariate normal and multivariate Student t distributions.  相似文献   

20.
When clustering multivariate observations adhering the mixture model of Gaussian distributions, rather frequently projections of the observations onto a linear subspace of less dimensionality, called discriminant space (DS), contain all statistical information about the cluster structure of the model. In this case, the actual reduction of data dimensionality substantially facilitates a solution of various classification problems. In the paper, attention is devoted to statistical testing of hypotheses about DS and its dimension. The characterization of DS and methods of its identification are also briefly discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号