首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
Abstract

The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxima, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged on an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.  相似文献   

2.
A mixture approach to clustering is an important technique in cluster analysis. A mixture of multivariate multinomial distributions is usually used to analyze categorical data with latent class model. The parameter estimation is an important step for a mixture distribution. Described here are four approaches to estimating the parameters of a mixture of multivariate multinomial distributions. The first approach is an extended maximum likelihood (ML) method. The second approach is based on the well-known expectation maximization (EM) algorithm. The third approach is the classification maximum likelihood (CML) algorithm. In this paper, we propose a new approach using the so-called fuzzy class model and then create the fuzzy classification maximum likelihood (FCML) approach for categorical data. The accuracy, robustness and effectiveness of these four types of algorithms for estimating the parameters of multivariate binomial mixtures are compared using real empirical data and samples drawn from the multivariate binomial mixtures of two classes. The results show that the proposed FCML algorithm presents better accuracy, robustness and effectiveness. Overall, the FCML algorithm has the superiority over the ML, EM and CML algorithms. Thus, we recommend FCML as another good tool for estimating the parameters of mixture multivariate multinomial models.  相似文献   

3.
In model-based cluster analysis, the expectation-maximization (EM) algorithm has a number of desirable properties, but in some situations, this algorithm can be slow to converge. Some variants are proposed to speed-up EM in reducing the time spent in the E-step, in the case of Gaussian mixture. The main aims of such methods is first to speed-up convergence of EM, and second to yield same results (or not so far) than EM itself. In this paper, we compare these methods from categorical data, with the latent class model, and we propose a new variant that sustains better results on synthetic and real data sets, in terms of convergence speed-up and number of misclassified objects.  相似文献   

4.
Joint latent class modeling of disease prevalence and high-dimensional semicontinuous biomarker data has been proposed to study the relationship between diseases and their related biomarkers. However, statistical inference of the joint latent class modeling approach has proved very challenging due to its computational complexity in seeking maximum likelihood estimates. In this article, we propose a series of composite likelihoods for maximum composite likelihood estimation, as well as an enhanced Monte Carlo expectation–maximization (MCEM) algorithm for maximum likelihood estimation, in the context of joint latent class models. Theoretically, the maximum composite likelihood estimates are consistent and asymptotically normal. Numerically, we have shown that, as compared to the MCEM algorithm that maximizes the full likelihood, not only the composite likelihood approach that is coupled with the quasi-Newton method can substantially reduce the computational complexity and duration, but it can simultaneously retain comparative estimation efficiency.  相似文献   

5.
Surveillance to detect cancer recurrence is an important part of care for cancer survivors.In this paper we discuss the design of optimal strategies for early detection of disease recurrence based on each patient’s distinct biomarker trajectory and periodically updated risk estimated in the setting of a prospective cohort study.We adopt a latent class joint model which considers a longitudinal biomarker process and an event process jointly,to address heterogeneity of patients and disease,to discover distinct biomarker trajectory patterns,to classify patients into different risk groups,and to predict the risk of disease recurrence.The model is used to develop a monitoring strategy that dynamically modifies the monitoring intervals according to patients’ current risk derived from periodically updated biomarker measurements and other indicators of disease spread.The optimal biomarker assessment time is derived using a utility function.We develop an algorithm to apply the proposed strategy to monitoring of new patients after initial treatment.We illustrate the models and the derivation of the optimal strategy using simulated data from monitoring prostate cancer recurrence over a 5-year period.  相似文献   

6.
偏t正态分布是分析尖峰,厚尾数据的重要统计工具之一.研究提出了偏t正态数据下混合线性联合位置与尺度模型,通过EM算法和Newton-Raphson方法研究了该模型参数的极大似然估计.并通过随机模拟试验验证了所提出方法的有效性.最后,结合实际数据验证了该模型和方法具有实用性和可行性.  相似文献   

7.
In this article, we propose an unbiased estimating equation approach for a two-component mixture model with correlated response data. We adapt the mixture-of-experts model and a generalized linear model for component distribution and mixing proportion, respectively. The new approach only requires marginal distributions of both component densities and latent variables. We use serial correlations from subjects’ subgroup memberships, which improves estimation efficiency and classification accuracy, and show that estimation consistency does not depend on the choice of the working correlation matrix. The proposed estimating equation is solved by an expectation-estimating-equation (EEE) algorithm. In the E-step of the EEE algorithm, we propose a joint imputation based on the conditional linear property for the multivariate Bernoulli distribution. In addition, we establish asymptotic properties for the proposed estimators and the convergence property using the EEE algorithm. Our method is compared to an existing competitive mixture model approach in both simulation studies and an election data application. Supplementary materials for this article are available online.  相似文献   

8.
This paper examines the analysis of an extended finite mixture of factor analyzers (MFA) where both the continuous latent variable (common factor) and the categorical latent variable (component label) are assumed to be influenced by the effects of fixed observed covariates. A polytomous logistic regression model is used to link the categorical latent variable to its corresponding covariate, while a traditional linear model with normal noise is used to model the effect of the covariate on the continuous latent variable. The proposed model turns out be in various ways an extension of many existing related models, and as such offers the potential to address some of the issues not fully handled by those previous models. A detailed derivation of an EM algorithm is proposed for parameter estimation, and latent variable estimates are obtained as by-products of the overall estimation procedure.  相似文献   

9.
本文将半参数线性混合效应模型推广应用到一类具有零膨胀的纵向数据或集群数据的研究中,提出了一类新的半参数混合效应模型,然后利用广义交叉核实法选取光滑参数,通过最大惩罚似然函数方法与EM算法给出了模型参数部分与非参数部分的估计方法,最后,通过模拟和实例说明了本文方法的有效性.  相似文献   

10.
In many longitudinal studies,observation times as well as censoring times may be correlated with longitudinal responses.This paper considers a multiplicative random effects model for the longitudinal response where these correlations may exist and a joint modeling approach is proposed via a shared latent variable.For inference about regression parameters,estimating equation approaches are developed and asymptotic properties of the proposed estimators are established.The finite sample behavior of the methods is examined through simulation studies and an application to a data set from a bladder cancer study is provided for illustration.  相似文献   

11.
We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the conditional independence assumption. Under this new mixture model, named conditional modes model (CMM), variables are grouped into conditionally independent blocks. Each block follows a parsimonious multinomial distribution where the few free parameters model the probabilities of the most likely levels, while the remaining probability mass is uniformly spread over the other levels of the block. Thus, when the conditional independence assumption holds, this model defines parsimonious versions of the standard latent class model. Moreover, when this assumption is violated, the proposed model brings out the main intra-class dependencies between variables, summarizing thus each class with relatively few characteristic levels. The model selection is carried out by an hybrid MCMC algorithm that does not require preliminary parameter estimation. Then, the maximum likelihood estimation is performed via an EM algorithm only for the best model. The model properties are illustrated on simulated data and on three real data sets by using the associated R package CoModes. The results show that this model allows to reduce biases involved by the conditional independence assumption while providing meaningful parameters.  相似文献   

12.
Latent trait models such as item response theory (IRT) hypothesize a functional relationship between an unobservable, or latent, variable and an observable outcome variable. In educational measurement, a discrete item response is usually the observable outcome variable, and the latent variable is associated with an examinee’s trait level (e.g., skill, proficiency). The link between the two variables is called an item response function. This function, defined by a set of item parameters, models the probability of observing a given item response, conditional on a specific trait level. Typically in a measurement setting, neither the item parameters nor the trait levels are known, and so must be estimated from the pattern of observed item responses. Although a maximum likelihood approach can be taken in estimating these parameters, it usually cannot be employed directly. Instead, a method of marginal maximum likelihood (MML) is utilized, via the expectation-maximization (EM) algorithm. Alternating between an expectation (E) step and a maximization (M) step, the EM algorithm assures that the marginal log likelihood function will not decrease after each EM cycle, and will converge to a local maximum. Interestingly, the negative of this marginal log likelihood function is equal to the relative entropy, or Kullback-Leibler divergence, between the conditional distribution of the latent variables given the observable variables and the joint likelihood of the latent and observable variables. With an unconstrained optimization for the M-step proposed here, the EM algorithm as minimization of Kullback-Leibler divergence admits the convergence results due to Csiszár and Tusnády (Statistics & Decisions, 1:205–237, 1984), a consequence of the binomial likelihood common to latent trait models with dichotomous response variables. For this unconstrained optimization, the EM algorithm converges to a global maximum of the marginal log likelihood function, yielding an information bound that permits a fixed point of reference against which models may be tested. A likelihood ratio test between marginal log likelihood functions obtained through constrained and unconstrained M-steps is provided as a means for testing models against this bound. Empirical examples demonstrate the approach.  相似文献   

13.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

14.
Latent tree models were proposed as a class of models for unsupervised learning, and have been applied to various problems such as clustering and density estimation. In this paper, we study the usefulness of latent tree models in another paradigm, namely supervised learning. We propose a novel generative classifier called latent tree classifier (LTC). An LTC represents each class-conditional distribution of attributes using a latent tree model, and uses Bayes rule to make prediction. Latent tree models can capture complex relationship among attributes. Therefore, LTC is able to approximate the true distribution behind data well and thus achieves good classification accuracy. We present an algorithm for learning LTC and empirically evaluate it on an extensive collection of UCI data. The results show that LTC compares favorably to the state-of-the-art in terms of classification accuracy. We also demonstrate that LTC can reveal underlying concepts and discover interesting subgroups within each class.  相似文献   

15.
基于改进的Cholesky分解,研究分析了纵向数据下半参数联合均值协方差模型的贝叶斯估计和贝叶斯统计诊断,其中非参数部分采用B样条逼近.主要通过应用Gibbs抽样和Metropolis-Hastings算法相结合的混合算法获得模型中未知参数的贝叶斯估计和贝叶斯数据删除影响诊断统计量.并利用诊断统计量的大小来识别数据的异常点.模拟研究和实例分析都表明提出的贝叶斯估计和诊断方法是可行有效的.  相似文献   

16.
We describe an extension of the hidden Markov model in which the manifest process conditionally follows a partition model. The assumption of local independence for the manifest random variable is thus relaxed to arbitrary dependence. The proposed class generalizes different existing models for discrete and continuous time series, and allows for the finest trading off between bias and variance. The models are fit through an EM algorithm, with the usual recursions for hidden Markov models extended at no additional computational cost.  相似文献   

17.
Semiparametric transformation models provide a class of flexible models for regression analysis of failure time data. Several authors have discussed them under different situations when covariates are timeindependent (Chen et al., 2002; Cheng et al., 1995; Fine et al., 1998). In this paper, we consider fitting these models to right-censored data when covariates are time-dependent longitudinal variables and, furthermore, may suffer measurement errors. For estimation, we investigate the maximum likelihood approach, and an EM algorithm is developed. Simulation results show that the proposed method is appropriate for practical application, and an illustrative example is provided.  相似文献   

18.
The cluster-weighted model (CWM) is a mixture model with random covariates that allows for flexible clustering/classification and distribution estimation of a random vector composed of a response variable and a set of covariates. Within this class of models, the generalized linear exponential CWM is here introduced especially for modeling bivariate data of mixed-type. Its natural counterpart in the family of latent class models is also defined. Maximum likelihood parameter estimates are derived using the expectation-maximization algorithm and some computational issues are detailed. Through Monte Carlo experiments, the classification performance of the proposed model is compared with other mixture-based approaches, consistency of the estimators of the regression coefficients is evaluated, and several likelihood-based information criteria are compared for selecting the number of mixture components. An application to real data is also finally considered.  相似文献   

19.
For semiparametric survival models with interval-censored data and a cure fraction, it is often difficult to derive nonparametric maximum likelihood estimation due to the challenge in maximizing the complex likelihood function. In this article, we propose a computationally efficient EM algorithm, facilitated by a gamma-Poisson data augmentation, for maximum likelihood estimation in a class of generalized odds rate mixture cure (GORMC) models with interval-censored data. The gamma-Poisson data augmentation greatly simplifies the EM estimation and enhances the convergence speed of the EM algorithm. The empirical properties of the proposed method are examined through extensive simulation studies and compared with numerical maximum likelihood estimates. An R package “GORCure” is developed to implement the proposed method and its use is illustrated by an application to the Aerobic Center Longitudinal Study dataset. Supplementary material for this article is available online.  相似文献   

20.
A cured model is a useful approach for analysing failure time data in which some subjects could eventually experience and others never experience the event of interest. All subjects in the test belong to one of the two groups: the susceptible group and the non-susceptible group. There has been considerable progress in the development of semi-parametric models for regression analysis of time-to-event data. However, most of the current work focuses on right-censored data, especially when the population contains a non-ignorable cured subgroup. In this paper, we propose a semi-parametric cure model for current status data. In general, treatments are developed to both increase the patients' chances of being cured and prolong the survival time among non-cured patients. A logistic regression model is proposed for whether the subject is in the susceptible group. An accelerated failure time regression model is proposed for the event time when the subject is in the non-susceptible group. An EM algorithm is used to maximize the log-likelihood of the observed data. Simulation results show that the proposed method can get efficient estimations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号