首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
A mixture approach to clustering is an important technique in cluster analysis. A mixture of multivariate multinomial distributions is usually used to analyze categorical data with latent class model. The parameter estimation is an important step for a mixture distribution. Described here are four approaches to estimating the parameters of a mixture of multivariate multinomial distributions. The first approach is an extended maximum likelihood (ML) method. The second approach is based on the well-known expectation maximization (EM) algorithm. The third approach is the classification maximum likelihood (CML) algorithm. In this paper, we propose a new approach using the so-called fuzzy class model and then create the fuzzy classification maximum likelihood (FCML) approach for categorical data. The accuracy, robustness and effectiveness of these four types of algorithms for estimating the parameters of multivariate binomial mixtures are compared using real empirical data and samples drawn from the multivariate binomial mixtures of two classes. The results show that the proposed FCML algorithm presents better accuracy, robustness and effectiveness. Overall, the FCML algorithm has the superiority over the ML, EM and CML algorithms. Thus, we recommend FCML as another good tool for estimating the parameters of mixture multivariate multinomial models.  相似文献   

2.
A finite mixture model has been used to fit the data from heterogeneous populations to many applications. An Expectation Maximization (EM) algorithm is the most popular method to estimate parameters in a finite mixture model. A Bayesian approach is another method for fitting a mixture model. However, the EM algorithm often converges to the local maximum regions, and it is sensitive to the choice of starting points. In the Bayesian approach, the Markov Chain Monte Carlo (MCMC) sometimes converges to the local mode and is difficult to move to another mode. Hence, in this paper we propose a new method to improve the limitation of EM algorithm so that the EM can estimate the parameters at the global maximum region and to develop a more effective Bayesian approach so that the MCMC chain moves from one mode to another more easily in the mixture model. Our approach is developed by using both simulated annealing (SA) and adaptive rejection metropolis sampling (ARMS). Although SA is a well-known approach for detecting distinct modes, the limitation of SA is the difficulty in choosing sequences of proper proposal distributions for a target distribution. Since ARMS uses a piecewise linear envelope function for a proposal distribution, we incorporate ARMS into an SA approach so that we can start a more proper proposal distribution and detect separate modes. As a result, we can detect the maximum region and estimate parameters for this global region. We refer to this approach as ARMS annealing. By putting together ARMS annealing with the EM algorithm and with the Bayesian approach, respectively, we have proposed two approaches: an EM-ARMS annealing algorithm and a Bayesian-ARMS annealing approach. We compare our two approaches with traditional EM algorithm alone and Bayesian approach alone using simulation, showing that our two approaches are comparable to each other but perform better than EM algorithm alone and Bayesian approach alone. Our two approaches detect the global maximum region well and estimate the parameters in this region. We demonstrate the advantage of our approaches using an example of the mixture of two Poisson regression models. This mixture model is used to analyze a survey data on the number of charitable donations.  相似文献   

3.
Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data and can be estimated by likelihood maximization through the Expectation–Maximization algorithm. A suitable number of components is then determined conventionally by comparing different mixture models using penalized log-likelihood criteria such as Bayesian information criterion. We propose fitting MLMMs with variational methods, which can perform parameter estimation and model selection simultaneously. We describe a variational approximation for MLMMs where the variational lower bound is in closed form, allowing for fast evaluation and develop a novel variational greedy algorithm for model selection and learning of the mixture components. This approach handles algorithm initialization and returns a plausible number of mixture components automatically. In cases of weak identifiability of certain model parameters, we use hierarchical centering to reparameterize the model and show empirically that there is a gain in efficiency in variational algorithms similar to that in Markov chain Monte Carlo (MCMC) algorithms. Related to this, we prove that the approximate rate of convergence of variational algorithms by Gaussian approximation is equal to that of the corresponding Gibbs sampler, which suggests that reparameterizations can lead to improved convergence in variational algorithms just as in MCMC algorithms. Supplementary materials for the article are available online.  相似文献   

4.
混合模型已成为数据分析中最流行的技术之一,由于拥有数学模型,它通常比聚类分析中的传统的方法产生的结果更精确,而关键因素是混合模型中子总体个数,它决定了数据分析的最终结果。期望最大化(EM)算法常用在混合模型的参数估计,以及机器学习和聚类领域中的参数估计中,是一种从不完全数据或者是有缺失值的数据中求解参数极大似然估计的迭代算法。学者们往往采用AIC和BIC的方法来确定子总体的个数,而这两种方法在实际的应用中的效果并不稳定,甚至可能会产生错误的结果。针对此问题,本文提出了一种利用似然函数的碎石图来确定混合模型中子总体的个数的新方法。实验结果表明,本文方法确定的子总体的个数在大部分理想的情况下可以得到与AIC、BIC方法确定的聚类个数相同的结果,而在一般的实际数据中或条件不理想的状态下,碎石图方法也可以得到更可靠的结果。随后,本文将新方法在选取的黄石公园喷泉数据的参数估计中进行了实际的应用。  相似文献   

5.
For several years, model-based clustering methods have successfully tackled many of the challenges presented by data-analysts. However, as the scope of data analysis has evolved, some problems may be beyond the standard mixture model framework. One such problem is when observations in a dataset come from overlapping clusters, whereby different clusters will possess similar parameters for multiple variables. In this setting, mixed membership models, a soft clustering approach whereby observations are not restricted to single cluster membership, have proved to be an effective tool. In this paper, a method for fitting mixed membership models to data generated by a member of an exponential family is outlined. The method is applied to count data obtained from an ultra running competition, and compared with a standard mixture model approach.  相似文献   

6.
To integrate economic considerations into management decisions in ecosystem frameworks, we need to build models that capture observed system dynamics and incorporate existing knowledge of ecosystems, while at the same time accommodating economic analysis. The main constraint for models to serve in economic analysis is dimensionality. In addition, to apply in long‐term management analysis, models should be stable in terms of adjustments to new observations. We use the ensemble Kalman filter to fit relatively simple models to ecosystem or foodweb data and estimate parameters that are stable over the observed variability in the data. The filter also provides a lower bound on the noise terms that a stochastic analysis requires. In this paper, we apply the filter to model the main interactions in the Barents Sea ecosystem. In a comparison, our method outperforms a regression‐based approach.  相似文献   

7.
Cluster analysis is an unsupervised learning technique for partitioning objects into several clusters. Assuming that noisy objects are included, we propose a soft clustering method which assigns objects that are significantly different from noise into one of the specified number of clusters by controlling decision errors through multiple testing. The parameters of the Gaussian mixture model are estimated from the EM algorithm. Using the estimated probability density function, we formulated a multiple hypothesis testing for the clustering problem, and the positive false discovery rate (pFDR) is calculated as our decision error. The proposed procedure classifies objects into significant data or noise simultaneously according to the specified target pFDR level. When applied to real and artificial data sets, it was able to control the target pFDR reasonably well, offering a satisfactory clustering performance.  相似文献   

8.
In this article, we propose an improvement on the sequential updating and greedy search (SUGS) algorithm for fast fitting of Dirichlet process mixture models. The SUGS algorithm provides a means for very fast approximate Bayesian inference for mixture data which is particularly of use when datasets are so large that many standard Markov chain Monte Carlo (MCMC) algorithms cannot be applied efficiently, or take a prohibitively long time to converge. In particular, these ideas are used to initially interrogate the data, and to refine models such that one can potentially apply exact data analysis later on. SUGS relies upon sequentially allocating data to clusters and proceeding with an update of the posterior on the subsequent allocations and parameters which assumes this allocation is correct. Our modification softens this approach, by providing a probability distribution over allocations, with a similar computational cost; this approach has an interpretation as a variational Bayes procedure and hence we term it variational SUGS (VSUGS). It is shown in simulated examples that VSUGS can outperform, in terms of density estimation and classification, a version of the SUGS algorithm in many scenarios. In addition, we present a data analysis for flow cytometry data, and SNP data via a three-class Dirichlet process mixture model, illustrating the apparent improvement over the original SUGS algorithm.  相似文献   

9.
This paper develops a Bayesian approach to analyzing quantile regression models for censored dynamic panel data. We employ a likelihood-based approach using the asymmetric Laplace error distribution and introduce lagged observed responses into the conditional quantile function. We also deal with the initial conditions problem in dynamic panel data models by introducing correlated random effects into the model. For posterior inference, we propose a Gibbs sampling algorithm based on a location-scale mixture representation of the asymmetric Laplace distribution. It is shown that the mixture representation provides fully tractable conditional posterior densities and considerably simplifies existing estimation procedures for quantile regression models. In addition, we explain how the proposed Gibbs sampler can be utilized for the calculation of marginal likelihood and the modal estimation. Our approach is illustrated with real data on medical expenditures.  相似文献   

10.
In many applications,covariates can be naturally grouped.For example,for gene expression data analysis,genes belonging to the same pathway might be viewed as a group.This paper studies variable selection problem for censored survival data in the additive hazards model when covariates are grouped.A hierarchical regularization method is proposed to simultaneously estimate parameters and select important variables at both the group level and the within-group level.For the situations in which the number of parameters tends to∞as the sample size increases,we establish an oracle property and asymptotic normality property of the proposed estimators.Numerical results indicate that the hierarchically penalized method performs better than some existing methods such as lasso,smoothly clipped absolute deviation(SCAD)and adaptive lasso.  相似文献   

11.
The hybrid bootstrap uses resampling ideas to extend the duality approach to interval estimation for a parameter of interest when there are nuisance parameters. The confidence region constructed by the hybrid bootstrap may perform much better than the parametric bootstrap region in situations where the data provide substantial information about the nuisance parameter, but limited information about the parameter of interest. We apply this method to estimate the location of quantitative trait loci (QTL) in interval mapping model. The conditional distribution of quantitative traits, given flanked genetic marker genotypes is often assumed to be the mixture model of two phenotype distributions. The mixing proportions in the model represent the recombination rate between a genetic marker and quantitative trait loci and provides information about the unknown location of the QTL. Since recombination events are unlikely, we will have less information about the location of the QTL than other parameters. This observation makes a hybrid approach to interval estimation for QTL appealing, especially since the necessary distribution theory, which is often a challenge for mixture models, can be handled by bootstrap simulation.  相似文献   

12.
Empirical-likelihood-based inference for the parameters in a partially linear single-index model with randomly censored data is investigated. We introduce an estimated empirical likelihood for the parameters using a synthetic data approach and show that its limiting distribution is a mixture of central chi-squared distribution. To attack this difficulty we propose an adjusted empirical likelihood to achieve the standard χ2-limit. Furthermore, since the index is of norm 1, we use this constraint to reduce the dimension of parameters, which increases the accuracy of the confidence regions. A simulation study is carried out to compare its finite-sample properties with the existing method. An application to a real data set is illustrated.  相似文献   

13.
We analyze the reliability of NASA composite pressure vessels by using a new Bayesian semiparametric model. The data set consists of lifetimes of pressure vessels, wrapped with a Kevlar fiber, grouped by spool, subject to different stress levels; 10% of the data are right censored. The model that we consider is a regression on the log‐scale for the lifetimes, with fixed (stress) and random (spool) effects. The prior of the spool parameters is nonparametric, namely they are a sample from a normalized generalized gamma process, which encompasses the well‐known Dirichlet process. The nonparametric prior is assumed to robustify inferences to misspecification of the parametric prior. Here, this choice of likelihood and prior yields a new Bayesian model in reliability analysis. Via a Bayesian hierarchical approach, it is easy to analyze the reliability of the Kevlar fiber by predicting quantiles of the failure time when a new spool is selected at random from the population of spools. Moreover, for comparative purposes, we review the most interesting frequentist and Bayesian models analyzing this data set. Our credibility intervals of the quantiles of interest for a new random spool are narrower than those derived by previous Bayesian parametric literature, although the predictive goodness‐of‐fit performances are similar. Finally, as an original feature of our model, by means of the discreteness of the random‐effects distribution, we are able to cluster the spools into three different groups. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

14.
We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the conditional independence assumption. Under this new mixture model, named conditional modes model (CMM), variables are grouped into conditionally independent blocks. Each block follows a parsimonious multinomial distribution where the few free parameters model the probabilities of the most likely levels, while the remaining probability mass is uniformly spread over the other levels of the block. Thus, when the conditional independence assumption holds, this model defines parsimonious versions of the standard latent class model. Moreover, when this assumption is violated, the proposed model brings out the main intra-class dependencies between variables, summarizing thus each class with relatively few characteristic levels. The model selection is carried out by an hybrid MCMC algorithm that does not require preliminary parameter estimation. Then, the maximum likelihood estimation is performed via an EM algorithm only for the best model. The model properties are illustrated on simulated data and on three real data sets by using the associated R package CoModes. The results show that this model allows to reduce biases involved by the conditional independence assumption while providing meaningful parameters.  相似文献   

15.
For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paper proposes a model-based clustering approach with mixed binary and continuous variables where each binary attribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, and where the scores of the latent variables are estimated from the binary data. In economics, such variables are called utility functions and the assumption is that the binary attributes (the presence or the absence of a public service or utility) are determined by low and high values of these functions. In genetics, the latent response is interpreted as the ??liability?? to develop a qualitative trait or phenotype. The estimated scores of the latent variables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture model for clustering, instead of using a mixture of discrete and continuous distributions. After describing the method, this paper presents the results of both simulated and real-case data and compares the performances of the multivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions. Results show that the former model outperforms the mixture model for variables with different scales, both in terms of classification error rate and reproduction of the clusters means.  相似文献   

16.
This paper proposes a Metropolis–Hastings algorithm based on Markov chain Monte Carlo sampling, to estimate the parameters of the Abe–Ley distribution, which is a recently proposed Weibull-Sine-Skewed-von Mises mixture model, for bivariate circular-linear data. Current literature estimates the parameters of these mixture models using the expectation-maximization method, but we will show that this exhibits a few shortcomings for the considered mixture model. First, standard expectation-maximization does not guarantee convergence to a global optimum, because the likelihood is multi-modal, which results from the high dimensionality of the mixture’s likelihood. Second, given that expectation-maximization provides point estimates of the parameters only, the uncertainties of the estimates (e.g., confidence intervals) are not directly available in these methods. Hence, extra calculations are needed to quantify such uncertainty. We propose a Metropolis–Hastings based algorithm that avoids both shortcomings of expectation-maximization. Indeed, Metropolis–Hastings provides an approximation to the complete (posterior) distribution, given that it samples from the joint posterior of the mixture parameters. This facilitates direct inference (e.g., about uncertainty, multi-modality) from the estimation. In developing the algorithm, we tackle various challenges including convergence speed, label switching and selecting the optimum number of mixture components. We then (i) verify the effectiveness of the proposed algorithm on sample datasets with known true parameters, and further (ii) validate our methodology on an environmental dataset (a traditional application domain of Abe–Ley mixtures where measurements are function of direction). Finally, we (iii) demonstrate the usefulness of our approach in an application domain where the circular measurement is periodic in time.  相似文献   

17.
The paper introduces a methodology for visualizing on a dimension reduced subspace the classification structure and the geometric characteristics induced by an estimated Gaussian mixture model for discriminant analysis. In particular, we consider the case of mixture of mixture models with varying parametrization which allow for parsimonious models. The approach is an extension of an existing work on reducing dimensionality for model-based clustering based on Gaussian mixtures. Information on the dimension reduction subspace is provided by the variation on class locations and, depending on the estimated mixture model, on the variation on class dispersions. Projections along the estimated directions provide summary plots which help to visualize the structure of the classes and their characteristics. A suitable modification of the method allows us to recover the most discriminant directions, i.e., those that show maximal separation among classes. The approach is illustrated using simulated and real data.  相似文献   

18.
About Earthquake Forecasting by Markov Renewal Processes   总被引:2,自引:0,他引:2  
We propose and validate a new method for the evaluation of seismic hazard. In particular, our aim is to model large earthquakes consistently with the underlying geophysics. Therefore we propose a non-Poisson model, which takes into account occurrence history, improved with some physical constraints. Among the prevalent non-Poisson models, we chose the Markov renewal process, which is expected to be sufficient to capture the main characteristics, maintaining simplicity in analysis. However, due to the introduction of some physical constraint, our process differs significantly from others already presented in literature. A mixture of exponential + Weibull distributions is proposed for the waiting times and their parameters are estimated following the likelihood method. We validated our model, using data of earthquakes of high severity occurred in Turkey during the 20th century. Our results exhibit a good accordance with the real events.  相似文献   

19.
This article proposes a probability model for k-dimensional ordinal outcomes, that is, it considers inference for data recorded in k-dimensional contingency tables with ordinal factors. The proposed approach is based on full posterior inference, assuming a flexible underlying prior probability model for the contingency table cell probabilities. We use a variation of the traditional multivariate probit model, with latent scores that determine the observed data. In our model, a mixture of normals prior replaces the usual single multivariate normal model for the latent variables. By augmenting the prior model to a mixture of normals we generalize inference in two important ways. First, we allow for varying local dependence structure across the contingency table. Second, inference in ordinal multivariate probit models is plagued by problems related to the choice and resampling of cutoffs defined for these latent variables. We show how the proposed mixture model approach entirely removes these problems. We illustrate the methodology with two examples, one simulated dataset and one dataset of interrater agreement.  相似文献   

20.
Tweedie’s compound Poisson model is a popular method to model data with probability mass at zero and nonnegative, highly right-skewed distribution. Motivated by wide applications of the Tweedie model in various fields such as actuarial science, we investigate the grouped elastic net method for the Tweedie model in the context of the generalized linear model. To efficiently compute the estimation coefficients, we devise a two-layer algorithm that embeds the blockwise majorization descent method into an iteratively reweighted least square strategy. Integrated with the strong rule, the proposed algorithm is implemented in an easy-to-use R package HDtweedie, and is shown to compute the whole solution path very efficiently. Simulations are conducted to study the variable selection and model fitting performance of various lasso methods for the Tweedie model. The modeling applications in risk segmentation of insurance business are illustrated by analysis of an auto insurance claim dataset. Supplementary materials for this article are available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号