首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The available methods to handle missing values in principal component analysis only provide point estimates of the parameters (axes and components) and estimates of the missing values. To take into account the variability due to missing values a multiple imputation method is proposed. First a method to generate multiple imputed data sets from a principal component analysis model is defined. Then, two ways to visualize the uncertainty due to missing values onto the principal component analysis results are described. The first one consists in projecting the imputed data sets onto a reference configuration as supplementary elements to assess the stability of the individuals (respectively of the variables). The second one consists in performing a principal component analysis on each imputed data set and fitting each obtained configuration onto the reference one with Procrustes rotation. The latter strategy allows to assess the variability of the principal component analysis parameters induced by the missing values. The methodology is then evaluated from a real data set.  相似文献   

2.
在海量征信数据的背景下,为降低缺失数据插补的计算成本,提出收缩近邻插补方法.收缩近邻方法通过三阶段完成数据插补,第一阶段基于样本和变量的缺失比例计算入样概率,通过不等概抽样完成数据的收缩,第二阶段基于样本间距离,选取与缺失样本近邻的样本组成训练集,第三阶段建立随机森林模型进行迭代插补.利用Australian数据集和中国各银行数据集进行模拟研究,结果表明在确保一定插补精度的情况下,收缩近邻方法较大程度减少了计算量.  相似文献   

3.
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.  相似文献   

4.
Summary  The main purpose of this paper is a comparison of several imputation methods within the simple additive modelty =f(x) + ε where the independent variableX is affected by missing completely at random. Besides the well-known complete case analysis, mean imputation plus random noise, single imputation and two kinds of nearest neighbor imputations are used. A short introduction to the model, the missing mechanism, the inference, the imputation methods and their implementation is followed by the main focus—the simulation experiment. The methods are compared within the experiment based on the sample mean squared error, estimated variances and estimated biases off(x) at the knots.  相似文献   

5.
质量调整的价格指数编制中hedonic插补法的应用   总被引:1,自引:0,他引:1  
在数据缺失的情况下,插补法是一种常用的推断缺失数据的方法。在价格指数的编制中,在基期存在的产品可能在报告期从市面上消失,或者报告期出现了新产品。这都可以看作是数据缺失的情形。同时由于前后时期产品质量发生变化,所编制的价格指数中可能包含"质量变化偏差"。Hedonic插补法将hedonic方法与缺失数据的插补方法结合起来,既处理了缺失数据,又克服了价格指数中的质量变化偏差。本文讨论了hedonic插补法的多种可能形式,并比较了各种方法的特点。本文还利用中国笔记本电脑的数据编制了hedonic插补价格指数,进行了相关的实证分析。  相似文献   

6.
In this paper, we investigate the model checking problem for a partial linear model while some responses are missing at random. By imputation and marginal inverse probability weighted methods, two completed data sets are constructed. Based on the two completed data sets, we build two empirical process-based tests for examining the adequacy of partial linearity of the model. The asymptotic distributions of the test statistics under the null hypothesis and local alternative hypotheses are obtained respectively. A re-sampling approach is applied to obtain the approximation to the null distributions of the test statistics. Simulation results show that the proposed tests work well and both proposed methods have better finite sample properties compared with the complete case (CC) analysis which discards all the subjects with missing data.  相似文献   

7.
In many applications, some covariates could be missing for various reasons. Regression quantiles could be either biased or under-powered when ignoring the missing data. Multiple imputation and EM-based augment approach have been proposed to fully utilize the data with missing covariates for quantile regression. Both methods however are computationally expensive. We propose a fast imputation algorithm (FI) to handle the missing covariates in quantile regression, which is an extension of the fractional imputation in likelihood based regressions. FI and modified imputation algorithms (FIIPW and MIIPW) are compared to existing MI and IPW approaches in the simulation studies, and applied to part of of the National Collaborative Perinatal Project study.  相似文献   

8.
设有两个非参数总体,其样本数据不完全,用分数填补法补足缺失数据,得到两总体的"完全"样本数据,在此基础上构造两总体分位数差异的经验似然置信区间.模拟结果显示,分数填补法可以得到更加精确的置信区间.  相似文献   

9.
Analysis of uncertainty is often neglected in the evaluation of complex systems models, such as computational models used in hydrology or ecology. Prediction uncertainty arises from a variety of sources, such as input error, calibration accuracy, parameter sensitivity and parameter uncertainty. In this study, various computational approaches were investigated for analysing the impact of parameter uncertainty on predictions of streamflow for a water-balance hydrological model used in eastern Australia. The parameters and associated equations which had greatest impact on model output were determined by combining differential error analysis and Monte Carlo simulation with stochastic and deterministic sensitivity analysis. This integrated approach aids in the identification of insignificant or redundant parameters and provides support for further simplifications in the mathematical structure underlying the model. Parameter uncertainty was represented by a probability distribution and simulation experiments revealed that the shape (skewness) of the distribution had a significant effect on model output uncertainty. More specifically, increasing negative skewness of the parameter distribution correlated with decreasing width of the model output confidence interval (i.e. resulting in less uncertainty). For skewed distributions, characterisation of uncertainty is more accurate using the confidence interval from the cumulative distribution rather than using variance. The analytic approach also identified the key parameters and the non-linear flux equation most influential in affecting model output uncertainty.  相似文献   

10.
This paper presents a decomposition for the posterior distribution of the covarianee matrix of normal models under a family of prior distributions when missing data are ignorable and monotone. This decomposition is an extension of Bartlett′s decomposition of the Wishart distribution to monotone missing data. It is not only theoretically interesting but also practically useful. First, with monotone missing data, it allows more efficient drawing of parameters from the posterior distribution than the factorized likelihood approach. Furthermore, with nonmonotone missing data, it allows for a very efficient monotone date augmentation algorithm and thereby multiple imputation or the missing data needed to create a monotone pattern.  相似文献   

11.
目的对医院出院病人调查表普遍存在的数据缺失进行填补与分析,以保证统计调查表的质量,为医院以及上级卫生部门了解现状,进行预策和决策提供技术支持和质量保证。方法运用SAS9.1,采用多重填补方法Markov Chain Monte Carlo(MCMC)模型对缺失数据进行多次填补并综合分析。结果MCMC填补10次的结果最优。结论(Multiple Imputation)MI方法在解决医院出院病人调查表数据缺失时有优势,发挥空间较大,且填补效率较高。  相似文献   

12.
To accurately model software failure process with software reliability growth models, incorporating testing effort has shown to be important. In fact, testing effort allocation is also a difficult issue, and it directly affects the software release time when a reliability criteria has to be met. However, with an increasing number of parameters involved in these models, the uncertainty of parameters estimated from the failure data could greatly affect the decision. Hence, it is of importance to study the impact of these model parameters. In this paper, sensitivity of the software release time is investigated through various methods, including one-factor-at-a-time approach, design of experiments and global sensitivity analysis. It is shown that the results from the first two methods may not be accurate enough for the case of complex nonlinear model. Global sensitivity analysis performs better due to the consideration of the global parameter space. The limitations of different approaches are also discussed. Finally, to avoid further excessive adjustment of software release time, interval estimation is recommended for use and it can be obtained based on the results from global sensitivity analysis.  相似文献   

13.
舒鑫鑫  张莉  周勇 《数学学报》2017,60(5):865-882
分位数的估计在生物医学、社会经济调查等领域有着广泛的应用,然而在实际问题的研究中,往往由于各种人为或不可控因素造成数据收集不完全.本文在随机缺失(MAR)假设条件下,利用非参数核补法和局部多重插补法给出了响应变量缺失时样本分位数的估计,并利用经验过程等理论证明了由这两种方法得到的分位数估计的大样本性质,同时,使用重抽样方法给出了估计的渐近方差的估计,模拟结果验证了这两种方法的有效性.文章所提两种方法的优点在于:首先,所提出的缺失修正方法不需要对缺失概率的模型做任何假设;其次,方法亦适用于其他有关参数不可微的估计目标函数;最后,方法很容易地推广到一般M估计的情况,并可以对多个分位数同时进行估计.  相似文献   

14.
The capability of implementing a complete Bayesian analysis of experimental data has emerged over recent years due to computational advances developed within the statistical community. The objective of this paper is to provide a practical exposition of these methods in the illustrative context of a financial event study. The customary assumption of Gaussian errors underlying development of the model is later supplemented by considering Student-t errors, thus permitting a Bayesian sensitivity analysis. The supplied data analysis illustrates the advantages of the sampling-based Bayesian approach in allowing investigation of quantities beyond the scope of classical methods.  相似文献   

15.
The logistic regression framework has been for long time the most used statistical method when assessing customer credit risk. Recently, a more pragmatic approach has been adopted, where the first issue is credit risk prediction, instead of explanation. In this context, several classification techniques have been shown to perform well on credit scoring, such as support vector machines among others. While the investigation of better classifiers is an important research topic, the specific methodology chosen in real world applications has to deal with the challenges arising from the real world data collected in the industry. Such data are often highly unbalanced, part of the information can be missing and some common hypotheses, such as the i.i.d. one, can be violated. In this paper we present a case study based on a sample of IBM Italian customers, which presents all the challenges mentioned above. The main objective is to build and validate robust models, able to handle missing information, class unbalancedness and non-iid data points. We define a missing data imputation method and propose the use of an ensemble classification technique, subagging, particularly suitable for highly unbalanced data, such as credit scoring data. Both the imputation and subagging steps are embedded in a customized cross-validation loop, which handles dependencies between different credit requests. The methodology has been applied using several classifiers (kernel support vector machines, nearest neighbors, decision trees, Adaboost) and their subagged versions. The use of subagging improves the performance of the base classifier and we will show that subagging decision trees achieve better performance, still keeping the model simple and reasonably interpretable.  相似文献   

16.
We establish computationally flexible methods and algorithms for the analysis of multivariate skew normal models when missing values occur in the data. To facilitate the computation and simplify the theoretic derivation, two auxiliary permutation matrices are incorporated into the model for the determination of observed and missing components of each observation. Under missing at random mechanisms, we formulate an analytically simple ECM algorithm for calculating parameter estimation and retrieving each missing value with a single-valued imputation. Gibbs sampling is used to perform a Bayesian inference on model parameters and to create multiple imputations for missing values. The proposed methodologies are illustrated through a real data set and comparisons are made with those obtained from fitting the normal counterparts.  相似文献   

17.
Discrete Markov random field models provide a natural framework for representing images or spatial datasets. They model the spatial association present while providing a convenient Markovian dependency structure and strong edge-preservation properties. However, parameter estimation for discrete Markov random field models is difficult due to the complex form of the associated normalizing constant for the likelihood function. For large lattices, the reduced dependence approximation to the normalizing constant is based on the concept of performing computationally efficient and feasible forward recursions on smaller sublattices, which are then suitably combined to estimate the constant for the entire lattice. We present an efficient computational extension of the forward recursion approach for the autologistic model to lattices that have an irregularly shaped boundary and that may contain regions with no data; these lattices are typical in applications. Consequently, we also extend the reduced dependence approximation to these scenarios, enabling us to implement a practical and efficient nonsimulation-based approach for spatial data analysis within the variational Bayesian framework. The methodology is illustrated through application to simulated data and example images. The online supplementary materials include our C++ source code for computing the approximate normalizing constant and simulation studies.  相似文献   

18.
A hierarchical model is developed for the joint mortality analysis of pension scheme datasets. The proposed model allows for a rigorous statistical treatment of missing data. While our approach works for any missing data pattern, we are particularly interested in a scenario where some covariates are observed for members of one pension scheme but not the other. Therefore, our approach allows for the joint modelling of datasets which contain different information about individual lives. The proposed model generalizes the specification of parametric models when accounting for covariates. We consider parameter uncertainty using Bayesian techniques. Model parametrization is analysed in order to obtain an efficient MCMC sampler, and address model selection. The inferential framework described here accommodates any missing-data pattern, and turns out to be useful to analyse statistical relationships among covariates. Finally, we assess the financial impact of using the covariates, and of the optimal use of the whole available sample when combining data from different mortality experiences.  相似文献   

19.
李英华  秦永松 《数学研究》2008,41(4):426-433
在响应变量满足MAR缺失机制下,我们分别研究了基于观察到的完全样本数据对、基于固定补足后的“完全洋本”和基于分数线性回归填补后的“完全洋本”得到的回归系数的最小二乘估计的弱相合性、强相合性及渐近正态性,我们还通过数值模拟,比较了基于上述估计得到的β的置信区间的优劣。  相似文献   

20.
In practical survey sampling, nonresponse phenomenon is unavoidable. How to impute missing data is an important problem. There are several imputation methods in the literature. In this paper, the imputation method of the mean of ratios for missing data under uniform response is applied to the estimation of a finite population mean when the PPSWR sampling is used. The imputed estimator is valid under the corresponding response mechanism regardless of the model as well as under the ratio model regardless of the response mechanism. The approximately unbiased jackknife variance estimator is also presented. All of these results are extended to the case of non-uniform response. Simulation studies show the good performance of the proposed estimators.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号