首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
To perform multiple regression, the least squares estimator is commonly used. However, this estimator is not robust to outliers. Therefore, robust methods such as S-estimation have been proposed. These estimators flag any observation with a large residual as an outlier and downweight it in the further procedure. However, a large residual may be caused by an outlier in only one single predictor variable, and downweighting the complete observation results in a loss of information. Therefore, we propose the shooting S-estimator, a regression estimator that is especially designed for situations where a large number of observations suffer from contamination in a small number of predictor variables. The shooting S-estimator combines the ideas of the coordinate descent algorithm with simple S-regression, which makes it robust against componentwise contamination, at the cost of failing the regression equivariance property.  相似文献   

2.
On statistical models for regression diagnostics   总被引:2,自引:0,他引:2  
In regression diagnostics, the case deletion model (CDM) and the mean shift outlier model (MSOM) are commonly used in practice. In this paper we show that the estimates of CDM and MSOM are equal in a wide class of statistical models, which include LSE, MLE, Bayesian estimate andM-estimate in linear and nonlinear regression models; MLE in generalized linear models and exponential family nonlinear models; MLEs of transformation parameters of explanatory variables in a Box-Cox regression models and so on. Furthermore, we study some models, in which, the estimates are not exactly equal but are approximately equal for CDM and MSOM.  相似文献   

3.
Abstract

The massive flood of numbers in ongoing large-scale periodic economic and social surveys commonly leaves little time for anything but a cursory examination of the quality of the data, and few techniques exist for giving an overview of data activity. At the U.S. Bureau of Labor Statistics, a graphical and query-based solution to these problems has recently been adopted for data review in the Current Employment Statistics survey. Chief among the motivations for creating the new system were: (1) Reduce or eliminate the arduous paper review of thousands of sample reports by review analysts; (2) allow the review analysts a more global view of sample activity and at the same time make outlier detection less of a strain; and (3) present global views of estimates over time and among groups of subestimates. The specific graphics approaches used in the new system were designed to quickly portray both time series and cross-sectional aspects of the data, as these are both critical elements in the review process. The described system allows the data analysts to track down suspicious sample members by first graphically pinpointing questionable estimates, and then pinpointing questionable sample data used to produce those estimates. Query methods are used for cross-checking relationships among different sample data elements. Although designed for outlier detection and estimation, the data-representation methods employed in the system have opened up new possibilities for further statistical and economic uses of the data. The authors were torn between the desire for a completely automatic system of data review and the practical demands of an actual survey operating under imperfect conditions, and thus viewed the new system as an evolutionary advance, not as an ideal final solution. Possibilities opened up by the new system prompted some further thinking on finding an ideal state.  相似文献   

4.
Outlier mining is an important aspect in data mining and the outlier mining based on Cook distance is most commonly used. But we know that when the data have multicoUinearity, the traditional Cook method is no longer effective. Considering the excellence of the principal component estimation, we use it to substitute the least squares estimation, and then give the Cook distance measurement based on principal component estimation, which can be used in outlier mining. At the same time, we have done some research on related theories and application problems.  相似文献   

5.
In binary regression, symmetric links such as logit and probit are usually considered as standard. However, in the presence of unbalancing of ones and zeros, these links can be inappropriate and inflexible to fit the skewness in the response curve and likely to lead to misspecification. This is the case of covering some type of insurance, where it can be observed that the probability of a given binary response variable approaches zero at different rates than it approaches one. Furthermore, when usual links are considered, there is not a skewness parameter associated with the distribution chosen that, regardless of the linear predictor, is easily interpreted. In order to overcome such problems, a proposal for the construction of a set of new skew links is developed in this paper, where some of their properties are discussed. In this context, power links and their reversal versions are presented. A Bayesian inference approach using MCMC is developed for the presented models. The methodology is illustrated considering a sample of motor insurance policyholders selected randomly by gender. Results suggest that the proposed link functions are more appropriate than other alternative link functions commonly used in the literature. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

6.
In insurance (or in finance) practice, in a regression setting, there are cases where the error distribution is not normal and other cases where the set of data is contaminated due to outlier events. In such cases the classical credibility regression models lead to an unsatisfactory behavior of credibility estimators, and it is more appropriate to use quantile regression instead of the ordinary least squares estimation. However, these quantile credibility models cannot perform effectively when the set of data has nested (hierarchical) structure. This paper develops credibility models for regression quantiles with nested classification as an alternative to Norberg’s (1986) approach of random coefficient regression model with multi-stage nested classification. This paper illustrates two types of applications, one with insurance data and one with Fama/French financial data.  相似文献   

7.
The aim of this paper is to provide an alternative approach for estimating efficiency when a set of decision-making units uses non-discretionary inputs in the productive process. To test the influence of these variables, our proposal uses a multi-stage approach based on Tobit regressions. In order to avoid potential bias, a bootstrap procedure is used to estimate these regressions. This methodology allows enhancing other models previously proposed to introduce non-controllable inputs in data envelopment analysis (DEA) overcoming, thus, some of their main shortcomings. We illustrate our framework with an empirical application on Spanish high schools where non-controllable factors play a major role to explain educational achievements.  相似文献   

8.
Generalized linear mixed models with semiparametric random effects are useful in a wide variety of Bayesian applications. When the random effects arise from a mixture of Dirichlet process (MDP) model with normal base measure, Gibbs samplingalgorithms based on the Pólya urn scheme are often used to simulate posterior draws in conjugate models (essentially, linear regression models and models for binary outcomes). In the nonconjugate case, some common problems associated with existing simulation algorithms include convergence and mixing difficulties.

This article proposes an algorithm for MDP models with exponential family likelihoods and normal base measures. The algorithm proceeds by making a Laplace approximation to the likelihood function, thereby matching the proposal with that of the Gibbs sampler. The proposal is accepted or rejected via a Metropolis-Hastings step. For conjugate MDP models, the algorithm is identical to the Gibbs sampler. The performance of the technique is investigated using a Poisson regression model with semi-parametric random effects. The algorithm performs efficiently and reliably, even in problems where large-sample results do not guarantee the success of the Laplace approximation. This is demonstrated by a simulation study where most of the count data consist of small numbers. The technique is associated with substantial benefits relative to existing methods, both in terms of convergence properties and computational cost.  相似文献   

9.
Least squares method based on Euclidean distance and Lebesgue distance between fuzzy data is used to study parameter estimation of fuzzy linear regression model based on case deletion respectively. And the parameter estimations on two kinds of distance are compared. The input of the above model is real data and output is fuzzy data. The statistical diagnosis --- estimation standard error of regression equations is constructed to test highly influential point or outlier in observation data. At last through identifying highly influential point or outlier in actual data, it shows that the statistic constructed in this paper is effective.  相似文献   

10.
It is well known that the presence of outlier events can overestimate or underestimate the overall reserve when using the chain-ladder method. The lack of robustness of loss reserving estimators leads to the development of this paper. The appearance of outlier events (including large claims—catastrophic events) can offset the result of the ordinary chain ladder technique and perturb the reserving estimation. Our proposal is to apply robust statistical procedures to the loss reserving estimation, which are insensitive to the occurrence of outlier events in the data. This paper considers robust log-linear and ANOVA models to the analysis of loss reserving by using different type of robust estimators, such as LAD-estimators, M-estimators, LMS-estimators, LTS-estimators, MM-estimators (with initial S-estimate) and Adaptive-estimators. Comparisons of these estimators are also presented, with application of a well known data set.  相似文献   

11.
We introduce a binary regression accounting-based model for bankruptcy prediction of small and medium enterprises (SMEs). The main advantage of the model lies in its predictive performance in identifying defaulted SMEs. Another advantage, which is especially relevant for banks, is that the relationship between the accounting characteristics of SMEs and response is not assumed a priori (eg, linear, quadratic or cubic) and can be determined from the data. The proposed approach uses the quantile function of the generalized extreme value distribution as link function as well as smooth functions of accounting characteristics to flexibly model covariate effects. Therefore, the usual assumptions in scoring models of symmetric link function and linear or pre-specified covariate-response relationships are relaxed. Out-of-sample and out-of-time validation on Italian data shows that our proposal outperforms the commonly used (logistic) scoring model for different default horizons.  相似文献   

12.
In this paper we develop a simulation model to study bed occupancy levels in an Intensive Care Unit (ICU). The main contributions of this study are: (1) A proposal for generalized regression models to fully capture the high variability of patients’ length of stay; (2) Proof that a simulation model that does not incorporate the management decisions by clinical staff cannot be considered valid; (3) The development of a mathematical model to represent these management decisions, and (4) A proposal for a method combining optimization with simulation to estimate the model parameters. This provides a valid simulation model that includes the physician management of an ICU. Validation is accomplished by comparing distribution patterns in daily bed occupancy records against simulated bed occupancy data. The methodology is tested using data provided by the Hospital of Navarre in Spain.  相似文献   

13.
We propose numerical and graphical methods for outlier detection in hierarchical Bayes modeling and analyses of repeated measures regression data from multiple subjects; data from a single subject are generically called a “curve”. The first-stage of our model has curve-specific regression coefficients with possibly autoregressive errors of a prespecified order. The first-stage regression vectors for different curves are linked in a second-stage modeling step, possibly involving additional regression variables. Detection of thestage at which the curve appears to be an outlier and themagnitude and specific component of the violation at that stage is accomplished by embedding the null model into a larger parametric model that can accommodate such unusual observations. We give two examples to illustrate the diagnostics, develop a BUGS program to compute them using MCMC techniques, and examine the sensitivity of the conclusions to the prior modeling assumptions.  相似文献   

14.
具有测量误差的非线性模型的Bayes估计   总被引:1,自引:0,他引:1  
测量中大量的函数模型都是非线性回归模型.当回归变量含有一定的测量误差时,我们得到非线性测量误差模型.本讨论了这种模型中未知参数具有正态先验分布时的参数Bayes估计方法,并对这种估计进行了影响分析,证明了删除模型与均值漂移模型中参数的Bayes估计相同,利用Cook统计量给出了删除模型下参数的Bayes估计的影响度量.  相似文献   

15.
Linking end-customer preferences with variables controlled at a manufacturing plant is a main idea behind popular Design for Six Sigma management techniques. Multiple criteria decision making (MCDM) approaches can be used for such purposes, but in these techniques the decision-maker's (DM) utility function, if modelled explicitly, is considered known with certainty once assessed. Here, a new algorithm is proposed to solve a MCDM problem with applications to Design for Six Sigma based on a Bayesian methodology. At a first stage, it is assumed that there are process responses that are functions of certain controllable factors or regressors. This relation is modelled based on experimental data. At a second stage, the utility function of one or more DMs or customers is described in a statistical model as a function of the process responses, based on surveys. This step considers the uncertainty in the utility function(s) explicitly. The methodology presented then maximizes the probability that the DM's or customer's utility is greater than some given lower bound with respect to the controllable factors of the first stage. Both stages are modelled with Bayesian regression techniques. The advantages of using the Bayesian approach as opposed to traditional methods are highlighted.  相似文献   

16.
高质量的决策越来越依赖于高质量的数据挖掘及其分析,高质量的数据挖掘离不开高质量的数据.在大型仪器利用情况调查中,由于主客观因素,总是致使有些数据出现异常,影响数据的质量.这就需要通过适用的方法对异常数据进行检测处理.不同类型数据往往需要不同的异常值检测方法.分析了大型仪器利用情况调查数据的总体特点、一般方法,并以国家科技部平台中心主持的"我国大型仪器资源现状调查"(2009)中大型仪器使用机时和共享机时数据为主线,比较研究了回归方法、基于深度的方法和箱线图方法等对不同类型数据异常值检测的适用性.选取不同角度,检验并采用不同的适用方法,找出相关的可疑异常值,有助于下一步有效开展大型仪器利用情况异常数据的分析处理,提高数据质量,为大型仪器利用情况综合评价奠定基础,也为科技资源调查数据预处理中异常值检测方法提供有益借鉴.  相似文献   

17.
Model averaging is a good alternative to model selection, which can deal with the uncertainty from model selection process and make full use of the information from various candidate models. However, most of the existing model averaging criteria do not consider the influence of outliers on the estimation procedures. The purpose of this paper is to develop a robust model averaging approach based on the local outlier factor (LOF) algorithm which can downweight the outliers in the covariates. Asymptotic optimality of the proposed robust model averaging estimator is derived under some regularity conditions. Further, we prove the consistency of the LOF-based weight estimator tending to the theoretically optimal weight vector. Numerical studies including Monte Carlo simulations and a real data example are provided to illustrate our proposed methodology.  相似文献   

18.
System identification is a method used to obtain the modal characteristics of existing structural systems through dynamic observations. Modal characteristics of the system can be used for a variety of purposes, including model updates, damage assessment, active control and original design re-evaluation. In this paper, the transfer functions relating the input quantities (traffic load, wind speed and temperature variations) and output quantities (lateral and longitudinal movement) of the towers of the Bosphorus Suspension Bridge were defined with the help of two models, namely, the parametric Multiple Input–Single Output (MISO) Auto-Regressive with eXogenous input (ARX) and the multiple regression models. The latter model was primarily used to check for the existence of outlier measurement(s) and to identify the input quantities that have a significant contribution to the structural movements since outlier measurements in observations and insignificant input quantities increases the difficulty of defining the parameters of the inherently complex MISO ARX model. Least Squares (LS) and bi-square weighted robust predictors were used to determine the parameters of the multiple regression model used in this study. Regression analysis showed that there were no outlier measurements in the tower observations and the effect of wind speed on the longitudinal movements was statistically insignificant. Furthermore, the sensitivity of LS and bi-square robust predictors to outlier measurements were also checked in the regression analysis by adding rough errors to the observations. Finally, it was also observed that the MISO ARX512, ARX511, ARX411 and ARX415 models defined by taking into account the results of regression analysis estimate structural movements more accurately than the multiple regression model ARX010.  相似文献   

19.
An important model in handling the multivariate data is the partially linear single-index regression model with a very flexible distribution—beta distribution, which is commonly used to model data restricted to some open intervals on the line. In this paper, the score test is extended to the partially linear single-index beta regression model. The penalized likelihood estimation based on P-spline is proposed. Based on the estimation, the score test statistics about varying dispersion parameter is given. Its asymptotical property is investigated. Both simulated examples are used to illustrate our proposed methods.  相似文献   

20.
This paper explains some drawbacks on previous approaches for detecting influential observations in deterministic nonparametric data envelopment analysis models as developed by Yang et al. (Annals of Operations Research 173:89–103, 2010). For example efficiency scores and relative entropies obtained in this model are unimportant to outlier detection and the empirical distribution of all estimated relative entropies is not a Monte-Carlo approximation. In this paper we developed a new method to detect whether a specific DMU is truly influential and a statistical test has been applied to determine the significance level. An application for measuring efficiency of hospitals is used to show the superiority of this method that leads to significant advancements in outlier detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号