首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have been extensively studied, there is little work on graphical models for datasets with both continuous and discrete variables (mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by focusing on the conditional log-likelihood of each variable given the rest. The parameters have a natural group structure, and sparsity in the fitted graph is attained by incorporating a group lasso penalty, approximated by a weighted lasso penalty for computational efficiency. We demonstrate the effectiveness of our method through an extensive simulation study and apply it to a music annotation dataset (CAL500), obtaining a sparse and interpretable graphical model relating the continuous features of the audio signal to binary variables such as genre, emotions, and usage associated with particular songs. While we focus on binary discrete variables for the main presentation, we also show that the proposed methodology can be easily extended to general discrete variables.  相似文献   

2.
Probabilistic Decision Graphs (PDGs) are probabilistic graphical models that represent a factorisation of a discrete joint probability distribution using a “decision graph”-like structure over local marginal parameters. The structure of a PDG enables the model to capture some context specific independence relations that are not representable in the structure of more commonly used graphical models such as Bayesian networks and Markov networks. This sometimes makes operations in PDGs more efficient than in alternative models. PDGs have previously been defined only in the discrete case, assuming a multinomial joint distribution over the variables in the model. We extend PDGs to incorporate continuous variables, by assuming a Conditional Gaussian (CG) joint distribution. We also show how inference can be carried out in an efficient way.  相似文献   

3.
This paper reviews estimation problems with missing, or hidden data. We formulate this problem in the context of Markov models and consider two interrelated issues, namely, the estimation of a state given measured data and model parameters, and the estimation of model parameters given the measured data alone. We also consider situations where the measured data is, itself, incomplete in some sense. We deal with various combinations of discrete and continuous states and observations.  相似文献   

4.
This study provides operational guidance for building naïve Bayes Bayesian network (BN) models for bankruptcy prediction. First, we suggest a heuristic method that guides the selection of bankruptcy predictors. Based on the correlations and partial correlations among variables, the method aims at eliminating redundant and less relevant variables. A naïve Bayes model is developed using the proposed heuristic method and is found to perform well based on a 10-fold validation analysis. The developed naïve Bayes model consists of eight first-order variables, six of which are continuous. We also provide guidance on building a cascaded model by selecting second-order variables to compensate for missing values of first-order variables. Second, we analyze whether the number of states into which the six continuous variables are discretized has an impact on the model’s performance. Our results show that the model’s performance is the best when the number of states for discretization is either two or three. Starting from four states, the performance starts to deteriorate, probably due to over-fitting. Finally, we experiment whether modeling continuous variables with continuous distributions instead of discretizing them can improve the model’s performance. Our finding suggests that this is not true. One possible reason is that continuous distributions tested by the study do not represent well the underlying distributions of empirical data. Finally, the results of this study could also be applicable to business decision-making contexts other than bankruptcy prediction.  相似文献   

5.
We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parameterization of the model. Supplementary materials for this article are available online.  相似文献   

6.
This paper develops two copula models for fitting the insurance claim numbers with excess zeros and time-dependence. The joint distribution of the claims in two successive periods is modeled by a copula with discrete or continuous marginal distributions. The first model fits two successive claims by a bivariate copula with discrete marginal distributions. In the second model, a copula is used to model the random effects of the conjoint numbers of successive claims with continuous marginal distributions. Zero-inflated phenomenon is taken into account in the above copula models. The maximum likelihood is applied to estimate the parameters of the discrete copula model. A two-step procedure is proposed to estimate the parameters in the second model, with the first step to estimate the marginals, followed by the second step to estimate the unobserved random effect variables and the copula parameter. Simulations are performed to assess the proposed models and methodologies.  相似文献   

7.
Most current implementations of multiple imputation (MI) assume that data are missing at random (MAR), but this assumption is generally untestable. We performed analyses to test the effects of auxiliary variables on MI when the data are missing not at random (MNAR) using simulated data and real data. In the analyses we varied (a) the correlation, (b) the level of missing data, (c) the pattern of missing data, and (d) sample size. Results showed that MI performed adequately without auxiliary variables but they also had a modest impact on bias in the real data and improved efficiency in both data sets. The results of this study suggest that, counter to the concern about the violation of the MAR assumption, MI appears to be quite robust to missing data that are MNAR in analytic situations such as the ones presented here. Further, results can be made even better via the use of auxiliary variables, particularly when efficiency is a primary concern.  相似文献   

8.
The effects of quantized data upon parameter estimation are investigated by re-examining a variety of simple and complicated risk models previously studied by the author. In spite of this unifying theme, no general principles arise, except for demonstrating that estimation in models with two or more parameters can lead to unpredictable results, with or without the introduction to discrete data. In fact, certain common actuarial models are shown always to have poor estimation properties, even using substantial amounts of continuous data The paper concludes with a plea for the redevelopment of classical models that are continuous in nature, rather than perpetuating the current discrete multi-parameter models, whose estimation properties are poor, since modern technology now permits inexpensive capture of all kinds of continuous data.  相似文献   

9.
A clustering method is presented for analysing multivariate binary data with missing values. When not all values are observed, Govaert3 has studied the relations between clustering methods and statistical models. The author has shown how the identification of a mixture of Bernoulli distributions with the same parameter for all clusters and for all variables corresponds to a clustering criterion which uses L1 distance characterizing the MNDBIN method (Marchetti8). He first generalized this model by selecting parameters which can depend on variables and finally by selecting parameters which can depend both on variables and on clusters. We use the previous models to derive a clustering method adapted to missing data. This method optimizes a criterion by a standard iterative partitioning algorithm which removes the necessity either to ignore objects or to substitute the missing data. We study several versions of this algorithm and, finally, a brief account is given of the application of this method to some simulated data.  相似文献   

10.
Much work has focused on developing exact tests for the analysis of discrete data using log linear or logistic regression models. A parametric model is tested for a dataset by conditioning on the value of a sufficient statistic and determining the probability of obtaining another dataset as extreme or more extreme relative to the general model, where extremeness is determined by the value of a test statistic such as the chi-square or the log-likelihood ratio. Exact determination of these probabilities can be infeasible for high dimensional problems, and asymptotic approximations to them are often inaccurate when there are small data entries and/or there are many nuisance parameters. In these cases Monte Carlo methods can be used to estimate exact probabilities by randomly generating datasets (tables) that match the sufficient statistic of the original table. However, naive Monte Carlo methods produce tables that are usually far from matching the sufficient statistic. The Markov chain Monte Carlo method used in this work (the regression/attraction approach) uses attraction to concentrate the distribution around the set of tables that match the sufficient statistic, and uses regression to take advantage of information in tables that “almost” match. It is also more general than others in that it does not require the sufficient statistic to be linear, and it can be adapted to problems involving continuous variables. The method is applied to several high dimensional settings including four-way tables with a model of no four-way interaction, and a table of continuous data based on beta distributions. It is powerful enough to deal with the difficult problem of four-way tables and flexible enough to handle continuous data with a nonlinear sufficient statistic.  相似文献   

11.
In the general insurance modeling literature, there has been a lot of work based on univariate zero-truncated models, but little has been done in the multivariate zero-truncation cases, for instance a line of insurance business with various classes of policies. There are three types of zero-truncation in the multivariate setting: only records with all zeros are missing, zero counts for one or some classes are missing, or zeros are completely missing for all classes. In this paper, we focus on the first case, the so-called Type I zero-truncation, and a new multivariate zero-truncated hurdle model is developed to study it. The key idea of developing such a model is to identify a stochastic representation for the underlying random variables, which enables us to use the EM algorithm to simplify the estimation procedure. This model is used to analyze a health insurance claims dataset that contains claim counts from different categories of claims without common zero observations.  相似文献   

12.
本文把经济系统作为一类生灭过程来考虑 .应用人口控制论和森林系统的成功经验 ,研究经济系统的临界值问题 .首先 ,基于实际的经济分析、预测模型 ,在宏观层次上建立经济系统的控制模型 .连续模型便于理论研究 ,离散模型便于计算机仿真 .然后在这个控制模型的基础上 ,寻找使国民经济持续发展所需要的最小资产积累率表达形式 .本文得到的理论值将帮助我们更深刻地理解经济系统  相似文献   

13.
The binomial software reliability growth model (SRGM) contains most existing SRGMs proposed in earlier work as special cases, and can describe every software failure-occurrence pattern in continuous time. In this paper, we propose generalized binomial SRGMs in both continuous and discrete time, based on the idea of cumulative Bernoulli trials. It is shown that the proposed models give some new unusual discrete models as well as the well-known continuous SRGMs. Through numerical examples with actual software failure data, two estimation methods for model parameters with grouped data are provided, and the predictive model performance is examined quantitatively.  相似文献   

14.
The present paper deals with the identification and maximum likelihood estimation of systems of linear stochastic differential equations using panel data. So we only have a sample of discrete observations over time of the relevant variables for each individual. A popular approach in the social sciences advocates the estimation of the “exact discrete model” after a reparameterization with LISREL or similar programs for structural equations models. The “exact discrete model” corresponds to the continuous time model in the sense that observations at equidistant points in time that are generated by the latter system also satisfy the former. In the LISREL approach the reparameterized discrete time model is estimated first without taking into account the nonlinear mapping from the continuous to the discrete time parameters. In a second step, using the inverse mapping, the fundamental system parameters of the continuous time system in which we are interested, are inferred. However, some severe problems arise with this “indirect approach”. First, an identification problem may arise in multiple equation systems, since the matrix exponential function denning some of the new parameters is in general not one‐to‐one, and hence the inverse mapping mentioned above does not exist. Second, usually some sort of approximation of the time paths of the exogenous variables is necessary before the structural parameters of the system can be estimated with discrete data. Two simple approximation methods are discussed. In both approximation methods the resulting new discrete time parameters are connected in a complicated way. So estimating the reparameterized discrete model by OLS without restrictions does not yield maximum likelihood estimates of the desired continuous time parameters as claimed by some authors. Third, a further limitation of estimating the reparameterized model with programs for structural equations models is that even simple restrictions on the original fundamental parameters of the continuous time system cannot be dealt with. This issue is also discussed in some detail. For these reasons the “indirect method” cannot be recommended. In many cases the approach leads to misleading inferences. We strongly advocate the direct estimation of the continuous time parameters. This approach is more involved, because the exact discrete model is nonlinear in the original parameters. A computer program by Hermann Singer that provides appropriate maximum likelihood estimates is described.  相似文献   

15.
When missing data are either missing completely at random (MCAR) or missing at random (MAR), the maximum likelihood (ML) estimation procedure preserves many of its properties. However, in any statistical modeling, the distribution specification for the likelihood function is at best only an approximation to the real world. In particular, since the normal-distribution-based ML is typically applied to data with heterogeneous marginal skewness and kurtosis, it is necessary to know whether such a practice still generates consistent parameter estimates. When the manifest variables are linear combinations of independent random components and missing data are MAR, this paper shows that the normal-distribution-based MLE is consistent regardless of the distribution of the sample. Examples also show that the consistency of the MLE is not guaranteed for all nonnormally distributed samples. When the population follows a confirmatory factor model, and data are missing due to the magnitude of the factors, the MLE may not be consistent even when data are normally distributed. When data are missing due to the magnitude of measurement errors/uniqueness, MLEs for many of the covariance parameters related to the missing variables are still consistent. This paper also identifies and discusses the factors that affect the asymptotic biases of the MLE when data are not missing at random. In addition, the paper also shows that, under certain data models and MAR mechanism, the MLE is asymptotically normally distributed and the asymptotic covariance matrix is consistently estimated by the commonly used sandwich-type covariance matrix. The results indicate that certain formulas and/or conclusions in the existing literature may not be entirely correct.  相似文献   

16.
Abstract

A simple matrix formula is given for the observed information matrix when the EM algorithm is applied to categorical data with missing values. The formula requires only the design matrices, a matrix linking the complete and incomplete data, and a few simple derivatives. It can be easily programmed using a computer language with operators for matrix multiplication, element-by-element multiplication and division, matrix concatenation, and creation of diagonal and block diagonal arrays. The formula is applicable whenever the incomplete data can be expressed as a linear function of the complete data, such as when the observed counts represent the sum of latent classes, a supplemental margin, or the number censored. In addition, the formula applies to a wide variety of models for categorical data, including those with linear, logistic, and log-linear components. Examples include a linear model for genetics, a log-linear model for two variables and nonignorable nonresponse, the product of a log-linear model for two variables and a logit model for nonignorable nonresponse, a latent class model for the results of two diagnostic tests, and a product of linear models under double sampling.  相似文献   

17.
设两个样本数据不完全的线性模型,其中协变量的观测值不缺失,响应变量的观测值随机缺失。采用随机回归插补法对响应变量的缺失值进行补足,得到两个线性回归模型的"完全"样本数据,在一定条件下得到两响应变量分位数差异的对数经验似然比统计量的极限分布为加权x_1~2,并利用此结果构造分位数差异的经验似然置信区间。模拟结果表明在随机插补下得到的置信区间具有较高的覆盖精度。  相似文献   

18.
本文对两个样本数据不完全的线性模型展开讨论, 其中线性模型协变量的观测值不缺失, 响应变量的观测值随机缺失(MAR). 我们采用逆概率加权填补方法对响应变量的缺失值进行补足, 得到两个线性回归模型``完全'样本数据, 在``完全'样本数据的基础上构造了响应变量分位数差异的对数经验似然比统计量. 与以往研究结果不同的是本文在一定条件下证明了该统计量的极限分布为标准, 降低了由于权系数估计带来的误差, 进一步构造出了精度更高的分位数差异的经验似然置信区间.  相似文献   

19.
In this paper, we consider a general class of nonlinear mixed discrete programming problems. By introducing continuous variables to replace the discrete variables, the problem is first transformed into an equivalent nonlinear continuous optimization problem subject to original constraints and additional linear and quadratic constraints. Then, an exact penalty function is employed to construct a sequence of unconstrained optimization problems, each of which can be solved effectively by unconstrained optimization techniques, such as conjugate gradient or quasi-Newton methods. It is shown that any local optimal solution of the unconstrained optimization problem is a local optimal solution of the transformed nonlinear constrained continuous optimization problem when the penalty parameter is sufficiently large. Numerical experiments are carried out to test the efficiency of the proposed method.  相似文献   

20.
We propose a two-component graphical chain model, the discrete regression distribution, where a set of discrete random variables is modeled as a response to a set of categorical and continuous covariates. The proposed model is useful for modeling a set of discrete variables measured at multiple sites along with a set of continuous and/or discrete covariates. The proposed model allows for joint examination of the dependence structure of the discrete response and observed covariates and also accommodates site-to-site variability. We develop the graphical model properties and theoretical justifications of this model. Our model has several advantages over the traditional logistic normal model used to analyze similar compositional data, including site-specific random effect terms and the incorporation of discrete and continuous covariates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号