首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Latent class analysis (LCA) for categorical data is a model-based clustering and classification technique applied in a wide range of fields including the social sciences, machine learning, psychiatry, public health, and epidemiology. Its central assumption is conditional independence of the indicators given the latent class, i.e. “local independence”; violations can appear as model misfit, often leading LCA practitioners to increase the number of classes. However, when not all of the local dependence is of substantive scientific interest this leads to two options, that are both problematic: modeling uninterpretable classes, or retaining a lower number of substantive classes but incurring bias in the final results and classifications of interest due to remaining assumption violations. This paper suggests an alternative procedure, applicable in cases when the number of substantive classes is known in advance, or when substantive interest is otherwise well-defined. I suggest, in such cases, to model substantive local dependencies as additional discrete latent variables, while absorbing nuisance dependencies in additional parameters. An example application to the estimation of misclassification and turnover rates of the decision to vote in elections of 9510 Dutch residents demonstrates the advantages of this procedure relative to increasing the number of classes.  相似文献   

2.
Binary data latent class analysis is a form of model-based clustering applied in a wide range of fields. A central assumption of this model is that of conditional independence of responses given latent class membership, often referred to as the “local independence” assumption. The results of latent class analysis may be severely biased when this crucial assumption is violated; investigating the degree to which bivariate relationships between observed variables fit this hypothesis therefore provides vital information. This article evaluates three methods of doing so. The first is the commonly applied method of referring the so-called “bivariate residuals” to a Chi-square distribution. We also introduce two alternative methods that are novel to the investigation of local dependence in latent class analysis: bootstrapping the bivariate residuals, and the asymptotic score test or “modification index”. Our Monte Carlo simulation indicates that the latter two methods perform adequately, while the first method does not perform as intended.  相似文献   

3.
In multivariate categorical data, models based on conditional independence assumptions, such as latent class models, offer efficient estimation of complex dependencies. However, Bayesian versions of latent structure models for categorical data typically do not appropriately handle impossible combinations of variables, also known as structural zeros. Allowing nonzero probability for impossible combinations results in inaccurate estimates of joint and conditional probabilities, even for feasible combinations. We present an approach for estimating posterior distributions in Bayesian latent structure models with potentially many structural zeros. The basic idea is to treat the observed data as a truncated sample from an augmented dataset, thereby allowing us to exploit the conditional independence assumptions for computational expediency. As part of the approach, we develop an algorithm for collapsing a large set of structural zero combinations into a much smaller set of disjoint marginal conditions, which speeds up computation. We apply the approach to sample from a semiparametric version of the latent class model with structural zeros in the context of a key issue faced by national statistical agencies seeking to disseminate confidential data to the public: estimating the number of records in a sample that are unique in the population on a set of publicly available categorical variables. The latent class model offers remarkably accurate estimates of population uniqueness, even in the presence of a large number of structural zeros.  相似文献   

4.

We propose a novel extension of nonparametric multivariate finite mixture models by dropping the standard conditional independence assumption and incorporating the independent component analysis (ICA) structure instead. This innovation extends nonparametric mixture model estimation methods to situations in which conditional independence, a necessary assumption for the unique identifiability of the parameters in such models, is clearly violated. We formulate an objective function in terms of penalized smoothed Kullback–Leibler distance and introduce the nonlinear smoothed majorization-minimization independent component analysis algorithm for optimizing this function and estimating the model parameters. Our algorithm does not require any labeled observations a priori; it may be used for fully unsupervised clustering problems in a multivariate setting. We have implemented a practical version of this algorithm, which utilizes the FastICA algorithm, in the R package icamix. We illustrate this new methodology using several applications in unsupervised learning and image processing.

  相似文献   

5.
Block coordinate update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal–dual BCU method for solving linearly constrained convex program with multi-block variables. The method is an accelerated version of a primal–dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an O(1 / t) convergence rate under convexity assumption. We show that the rate can be accelerated to \(O(1/t^2)\) if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance.  相似文献   

6.
Pair-copula Bayesian networks (PCBNs) are a novel class of multivariate statistical models, which combine the distributional flexibility of pair-copula constructions (PCCs) with the parsimony of conditional independence models associated with directed acyclic graphs (DAGs). We are first to provide generic algorithms for random sampling and likelihood inference in arbitrary PCBNs as well as for selecting orderings of the parents of the vertices in the underlying graphs. Model selection of the DAG is facilitated using a version of the well-known PC algorithm that is based on a novel test for conditional independence of random variables tailored to the PCC framework. A simulation study shows the PC algorithm’s high aptitude for structure estimation in non-Gaussian PCBNs. The proposed methods are finally applied to modeling financial return data. Supplementary materials for this article are available online.  相似文献   

7.
In this article, novel joint semiparametric spline-based modeling of conditional mean and volatility of financial time series is proposed and evaluated on daily stock return data. The modeling includes functions of lagged response variables and time as predictors. The latter can be viewed as a proxy for omitted economic variables contributing to the underlying dynamics. The conditional mean model is additive. The conditional volatility model is multiplicative and linearized with a logarithmic transformation. In addition, a cube-root power transformation is employed to symmetrize the lagged response variables. Using cubic splines, the model can be written as a multiple linear regression, thereby allowing predictions to be obtained in a simple manner. As outliers are often present in financial data, reliable estimation of the model parameters is achieved by trimmed least-square (TLS) estimation for which a reasonable amount of trimming is suggested. To obtain a parsimonious specification of the model, a new model selection criterion corresponding to TLS is derived. Moreover, the (three-parameter) generalized gamma distribution is identified as suitable for the absolute multiplicative errors and shown to work well for predictions and also for the calculation of quantiles, which is important to determine the value at risk. All model choices are motivated by a detailed analysis of IBM, HP, and SAP daily returns. The prediction performance is compared to the classical generalized autoregressive conditional heteroskedasticity (GARCH) and asymmetric power GARCH (APGARCH) models as well as to a nonstationary time-trend volatility model. The results suggest that the proposed model may possess a high predictive power for future conditional volatility. Supplementary materials for this article are available online.  相似文献   

8.
Abstract

A simple matrix formula is given for the observed information matrix when the EM algorithm is applied to categorical data with missing values. The formula requires only the design matrices, a matrix linking the complete and incomplete data, and a few simple derivatives. It can be easily programmed using a computer language with operators for matrix multiplication, element-by-element multiplication and division, matrix concatenation, and creation of diagonal and block diagonal arrays. The formula is applicable whenever the incomplete data can be expressed as a linear function of the complete data, such as when the observed counts represent the sum of latent classes, a supplemental margin, or the number censored. In addition, the formula applies to a wide variety of models for categorical data, including those with linear, logistic, and log-linear components. Examples include a linear model for genetics, a log-linear model for two variables and nonignorable nonresponse, the product of a log-linear model for two variables and a logit model for nonignorable nonresponse, a latent class model for the results of two diagnostic tests, and a product of linear models under double sampling.  相似文献   

9.
We compare different selection criteria to choose the number of latent states of a multivariate latent Markov model for longitudinal data. This model is based on an underlying Markov chain to represent the evolution of a latent characteristic of a group of individuals over time. Then, the response variables observed at different occasions are assumed to be conditionally independent given this chain. Maximum likelihood estimation of the model is carried out through an Expectation–Maximization algorithm based on forward–backward recursions which are well known in the hidden Markov literature for time series. The selection criteria we consider are based on penalized versions of the maximum log-likelihood or on the posterior probabilities of belonging to each latent state, that is, the conditional probability of the latent state given the observed data. Among the latter criteria, we propose an appropriate entropy measure tailored for the latent Markov models. We show the results of a Monte Carlo simulation study aimed at comparing the performance of the above states selection criteria on the basis of a wide set of model specifications.  相似文献   

10.
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.  相似文献   

11.
This work develops a general procedure for clustering functional data which adapts the clustering method high dimensional data clustering (HDDC), originally proposed in the multivariate context. The resulting clustering method, called funHDDC, is based on a functional latent mixture model which fits the functional data in group-specific functional subspaces. By constraining model parameters within and between groups, a family of parsimonious models is exhibited which allow to fit onto various situations. An estimation procedure based on the EM algorithm is proposed for determining both the model parameters and the group-specific functional subspaces. Experiments on real-world datasets show that the proposed approach performs better or similarly than classical two-step clustering methods while providing useful interpretations of the groups and avoiding the uneasy choice of the discretization technique. In particular, funHDDC appears to always outperform HDDC applied on spline coefficients.  相似文献   

12.
Bayesian networks (BNs) have attained widespread use in data analysis and decision making. Well-studied topics include efficient inference, evidence propagation, parameter learning from data for complete and incomplete data scenarios, expert elicitation for calibrating BN probabilities, and structure learning. It is common for the researcher to assume the structure of the BN or to glean the structure from expert elicitation or domain knowledge. In this scenario, the model may be calibrated through learning the parameters from relevant data. There is a lack of work on model diagnostics for fitted BNs; this is the contribution of this article. We key on the definition of (conditional) independence to develop a graphical diagnostic that indicates whether the conditional independence assumptions imposed, when one assumes the structure of the BN, are supported by the data. We develop the approach theoretically and describe a Monte Carlo method to generate uncertainty measures for the consistency of the data with conditional independence assumptions under the model structure. We describe how this theoretical information and the data are presented in a graphical diagnostic tool. We demonstrate the approach through data simulated from BNs under different conditional independence assumptions. We also apply the diagnostic to a real-world dataset. The results presented in this article show that this approach is most feasible for smaller BNs—this is not peculiar to the proposed diagnostic graphic, but rather is related to the general difficulty of combining large BNs with data in any manner (such as through parameter estimation). It is the authors’ hope that this article helps highlight the need for more research into BN model diagnostics. This article has supplementary materials online.  相似文献   

13.
模型估计是机器学习领域一个重要的研究内容,动态数据的模型估计是系统辨识和系统控制的基础.针对AR时间序列模型辨识问题,证明了在给定阶数下AR模型参数的最小二乘估计本质上也是一种矩估计.根据结构风险最小化原理,通过对模型拟合度和模型复杂度的折衷,提出了基于稀疏结构迭代的AR序列模型估计算法,并讨论了基于广义岭估计的最优正则化参数选取规则.数值结果表明,方法能以节省参数的方式有效地实现AR模型的辨识,比矩估计法结果有明显改善.  相似文献   

14.
Purpose. Data from international educational assessments conducted in many countries are mostly analyzed using item response theory. The assumption that all items behave the same in all countries is often not tenable. The variability of item parameters across countries can be taken into account by assuming that the item parameters are random effects (De Jong et al. in J. Consum. Res. 34:260–278, 2007; De Jong and Steenkamp in Psychometrika 75:3–32, 2010). However, the complex latent structure of such a model, with latent variables both at the item and the person level, renders maximum likelihood estimation computationally challenging. We describe a variational estimation technique that consists of approximating the likelihood function by a computationally tractable lower bound. Methods. A mean field approximation to the posterior distribution of the latent variables was used. The update equations were derived for the specific case of discrete random effects and implemented in a Maximization Maximization algorithm (Neal and Hinton in M.I. Jordan (Ed.) Learning in Graphical Models, Kluwer Academic, Dordrecht, pp. 355–368, 1998). Parameter recovery was investigated in a simulation study. The method was also applied to the Progress in International Reading Study of 2006. Results. The model parameters were recovered well under all conditions of the simulation study. In the application, the estimated variances of the random item effects showed a high positive correlation with traditional measures for the lack of item invariance across groups. Conclusions. The mean field approximation and variational methods in general offer a computationally tractable alternative to exact maximum likelihood estimation.  相似文献   

15.
In this article, we propose an unbiased estimating equation approach for a two-component mixture model with correlated response data. We adapt the mixture-of-experts model and a generalized linear model for component distribution and mixing proportion, respectively. The new approach only requires marginal distributions of both component densities and latent variables. We use serial correlations from subjects’ subgroup memberships, which improves estimation efficiency and classification accuracy, and show that estimation consistency does not depend on the choice of the working correlation matrix. The proposed estimating equation is solved by an expectation-estimating-equation (EEE) algorithm. In the E-step of the EEE algorithm, we propose a joint imputation based on the conditional linear property for the multivariate Bernoulli distribution. In addition, we establish asymptotic properties for the proposed estimators and the convergence property using the EEE algorithm. Our method is compared to an existing competitive mixture model approach in both simulation studies and an election data application. Supplementary materials for this article are available online.  相似文献   

16.
Latent trait models such as item response theory (IRT) hypothesize a functional relationship between an unobservable, or latent, variable and an observable outcome variable. In educational measurement, a discrete item response is usually the observable outcome variable, and the latent variable is associated with an examinee’s trait level (e.g., skill, proficiency). The link between the two variables is called an item response function. This function, defined by a set of item parameters, models the probability of observing a given item response, conditional on a specific trait level. Typically in a measurement setting, neither the item parameters nor the trait levels are known, and so must be estimated from the pattern of observed item responses. Although a maximum likelihood approach can be taken in estimating these parameters, it usually cannot be employed directly. Instead, a method of marginal maximum likelihood (MML) is utilized, via the expectation-maximization (EM) algorithm. Alternating between an expectation (E) step and a maximization (M) step, the EM algorithm assures that the marginal log likelihood function will not decrease after each EM cycle, and will converge to a local maximum. Interestingly, the negative of this marginal log likelihood function is equal to the relative entropy, or Kullback-Leibler divergence, between the conditional distribution of the latent variables given the observable variables and the joint likelihood of the latent and observable variables. With an unconstrained optimization for the M-step proposed here, the EM algorithm as minimization of Kullback-Leibler divergence admits the convergence results due to Csiszár and Tusnády (Statistics & Decisions, 1:205–237, 1984), a consequence of the binomial likelihood common to latent trait models with dichotomous response variables. For this unconstrained optimization, the EM algorithm converges to a global maximum of the marginal log likelihood function, yielding an information bound that permits a fixed point of reference against which models may be tested. A likelihood ratio test between marginal log likelihood functions obtained through constrained and unconstrained M-steps is provided as a means for testing models against this bound. Empirical examples demonstrate the approach.  相似文献   

17.
Logistic regression techniques can be used to restrict the conditional probabilities of a Bayesian network for discrete variables. More specifically, each variable of the network can be modeled through a logistic regression model, in which the parents of the variable define the covariates. When all main effects and interactions between the parent variables are incorporated as covariates, the conditional probabilities are estimated without restrictions, as in a traditional Bayesian network. By incorporating interaction terms up to a specific order only, the number of parameters can be drastically reduced. Furthermore, ordered logistic regression can be used when the categories of a variable are ordered, resulting in even more parsimonious models. Parameters are estimated by a modified junction tree algorithm. The approach is illustrated with the Alarm network.  相似文献   

18.
It is natural to assume that a missing-data mechanism depends on latent variables in the analysis of incomplete data in latent variate modeling because latent variables are error-free and represent key notions investigated by applied researchers. Unfortunately, the missing-data mechanism is then not missing at random (NMAR). In this article, a new estimation method is proposed, which leads to consistent and asymptotically normal estimators for all parameters in a linear latent variate model, where the missing mechanism depends on the latent variables and no concrete functional form for the missing-data mechanism is used in estimation. The method to be proposed is a type of multi-sample analysis with or without mean structures, and hence, it is easy to implement. Complete-case analysis is shown to produce consistent estimators for some important parameters in the model.  相似文献   

19.
This paper considers a semi-parametric mixed model for longitudinal counts under the assumption that for conditional on a common random effect over time the repeated count responses of an individual follow a Poisson AR(1) (auto-regressive order 1) non-stationary correlation structure. A step-by-step estimation approach is developed which provides consistent estimators for the non-parametric function, regression parameters, variance of the random effects, and auto-correlation structure of the model. Proofs for the consistency properties of the estimators along with their convergence rates are derived. A simulation study is conducted to examine first the estimation effects on parameters when the non-parametric function is ignored, and then an overall estimation study is carried out in the presence of the non-parametric function by including its estimation as well.  相似文献   

20.
The pricing of insurance policies requires estimates of the total loss. The traditional compound model imposes an independence assumption on the number of claims and their individual sizes. Bivariate models, which model both variables jointly, eliminate this assumption. A regression approach allows policy holder characteristics and product features to be included in the model. This article presents a bivariate model that uses joint random effects across both response variables to induce dependence effects. Bayesian posterior estimation is done using Markov Chain Monte Carlo (MCMC) methods. A real data example demonstrates that our proposed model exhibits better fitting and forecasting capabilities than existing models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号