首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.  相似文献   

2.
Summary  The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets, a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as a useful technique for multidimensional outlier detection.  相似文献   

3.
Least squares method based on Euclidean distance and Lebesgue distance between fuzzy data is used to study parameter estimation of fuzzy linear regression model based on case deletion respectively. And the parameter estimations on two kinds of distance are compared. The input of the above model is real data and output is fuzzy data. The statistical diagnosis --- estimation standard error of regression equations is constructed to test highly influential point or outlier in observation data. At last through identifying highly influential point or outlier in actual data, it shows that the statistic constructed in this paper is effective.  相似文献   

4.
In this paper, we present a unified diagnostic method for linear measurement error models based upon the corrected likelihood of Nakamura (1990, Biometrika, 77, 127–137). Both global influence and local influence are discussed. The case-deletion model and mean-shift outlier model are considered, and they are shown to be approximately equivalent. Several diagnostic measures are derived and discussed. It is found that they can be written in terms of the residual and leverage measure. Some existing results are improved. Numerical example illustrates that our method is useful for diagnosing influential observations.  相似文献   

5.
空间变系数模型的统计诊断   总被引:1,自引:0,他引:1  
空间变系数模型作为一类有效的空间数据分析方法已经得到了广泛的应用.本文主要研究该模型的统计诊断与影响分析方法。首先我们基于数据删除模型定义了Cook统计量,其次我们基于均值漂移模型讨论了异常点的检验问题。  相似文献   

6.
A study is carried out to investigate the sampling properties of the outlier test statistics of a procedure developed for detecting temporary change in BL(1,1,1,1) processes. It is done with respect to the sample size, the type of outlier and the size of the coefficients of the BL(1,1,1,1) process. The results show that, in general, the outlier detection procedure is capable of detecting TC, although the performance is affected if ω is large. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

7.
The chain-ladder method is a widely used technique to forecast the reserves that have to be kept regarding claims that are known to exist, but for which the actual size is unknown at the time the reserves have to be set. In practice it can be easily seen that even one outlier can lead to a huge over- or underestimation of the overall reserve when using the chain-ladder method. This indicates that individual claims can be very influential when determining the chain-ladder estimates. In this paper the effect of contamination is mathematically analyzed by calculating influence functions in the generalized linear model framework corresponding to the chain-ladder method. It is proven that the influence functions are unbounded, confirming the sensitivity of the chain-ladder method to outliers. A robust alternative is introduced to estimate the generalized linear model parameters in a more outlier resistant way. Finally, based on the influence functions and the robust estimators, a diagnostic tool is presented highlighting the influence of every individual claim on the classical chain-ladder estimates. With this tool it is possible to detect immediately which claims have an abnormally positive or negative influence on the reserve estimates. Further examination of these influential points is then advisable. A study of artificial and real run-off triangles shows the good performance of the robust chain-ladder method and the diagnostic tool.  相似文献   

8.
异常点诊断是统计学中的经典问题.发现并减少异常点对纳税评估数据分析的影响是一项很有意义的研究.然而,通常的异常点诊断一般采用适用于单峰分布的全局识别方法.借鉴局部域相关积分(Local correlation integral)理论,提出基于非参数密度估计的识别方法.方法适用于多峰分布,能识别局域性质的异常点,对异常点占比较高的样本也有较强的识别能力.基于某市10 920个企业样本,实证分析对比研究了税务局目前使用的和建议的纳税评估方法,结果表明税务局采用的方法有较大的纳税评估风险(误判风险).  相似文献   

9.
聂斌  王曦  胡雪 《运筹与管理》2019,28(1):101-107
在质量控制领域,非线性轮廓异常点识别问题是重点研究问题之一。本文综合运用了小波分析、数据深度、聚类分析等数据分析处理技术,提出了一种新的非正态变异的异常点识别方法。文章通过仿真分析技术,将新方法χ2与控制图方法进行性能对比,结果证实新方法能够以更高的准确率和稳定性识别异常点,表现出更好的异常点识别性能。最后将新方法应用于木板垂直密度轮廓实例对新方法进行验证,分析结果表明本方法能够有效识别出异常轮廓数据。  相似文献   

10.
In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results, particularly for two-stage analyses. In the DEA-related literature, prior work on this issue has focused on the efficient frontier as a basis for detecting outliers. An iterative approach for dealing with the potential for one outlier to mask the presence of another has been proposed but not demonstrated. This paper proposes using both the efficient frontier and the inefficient frontier to identify outliers and thereby improve the accuracy of second stage results in two-stage nonparametric analysis. The iterative outlier detection approach is implemented in a leave-one-out method using both the efficient frontier and the inefficient frontier and demonstrated in a two-stage semi-parametric bootstrapping analysis of a classic data set. The results show that the conclusions drawn can be different when outlier identification includes consideration of the inefficient frontier.  相似文献   

11.
In this paper we introduce COV, a novel information retrieval (IR) algorithm for massive databases based on vector space modeling and spectral analysis of the covariance matrix, for the document vectors, to reduce the scale of the problem. Since the dimension of the covariance matrix depends on the attribute space and is independent of the number of documents, COV can be applied to databases that are too massive for methods based on the singular value decomposition of the document-attribute matrix, such as latent semantic indexing (LSI). In addition to improved scalability, theoretical considerations indicate that results from our algorithm tend to be more accurate than those from LSI, particularly in detecting subtle differences in document vectors. We demonstrate the power and accuracy of COV through an important topic in data mining, known as outlier cluster detection. We propose two new algorithms for detecting major and outlier clusters in databases—the first is based on LSI, and the second on COV. Our implementation studies indicate that our cluster detection algorithms outperform the basic LSI and COV algorithm in detecting outlier clusters.  相似文献   

12.
We consider the problem of deleting bad influential observations (outliers) in linear regression models. The problem is formulated as a Quadratic Mixed Integer Programming (QMIP) problem, where penalty costs for discarding outliers are used into the objective function. The optimum solution defines a robust regression estimator called penalized trimmed squares (PTS). Due to the high computational complexity of the resulting QMIP problem, the proposed robust procedure is computationally suitable for small sample data. The computational performance and the effectiveness of the new procedure are improved significantly by using the idea of ε-Insensitive loss function from support vectors machine regression. Small errors are ignored, and the mathematical formula gains the sparseness property. The good performance of the ε-Insensitive PTS (IPTS) estimator allows identification of multiple outliers avoiding masking or swamping effects. The computational effectiveness and successful outlier detection of the proposed method is demonstrated via simulated experiments. This research has been partially funded by the Greek Ministry of Education under the program Pythagoras II.  相似文献   

13.
传统线性模型异常点识别方法容易发生误判:正常点被归为异常点或者异常点被归为正常点.为解决此类问题,提出了应用逆跳马尔科夫蒙特卡洛方法识别异常点的思想,同时将其应用于实际数据加以检验,识别效果明显好于传统方法.  相似文献   

14.
Cluster-based outlier detection   总被引:1,自引:0,他引:1  
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986, 2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points.  相似文献   

15.
异常交易行为的甄别研究   总被引:1,自引:1,他引:0  
本文在无指导学习的研究框架下,运用分位数回归模型结合变点检验,对中国证券市场的异常交易行为进行甄别研究。通过分析持股比例变动与股价收益率间协同演化关系的异常,为甄别异常交易行为设立判别标准并客观的界定阈值提供了一种新的方法。基于这一方法监管者可以构建分期、分级、分类的实时监管体系,提高监管效率。  相似文献   

16.
This paper proposes a statistical approach to handle the problem of detecting influential observations in deterministic nonparametric Data Envelopment Analysis (DEA) models. We use the bootstrap method to estimate the underlying distribution for efficiency scores in order to avoid making unrealistic assumptions about the true distribution. To measure whether a specific DMU is truly influential, we employ relative entropy to detect the change in the distribution after the DMU in question is removed. A statistical test has been applied to determine the significance level. Two examples from the literature are discussed and comparisons to previous methods are provided.  相似文献   

17.
In this paper we obtain new bounds for the minimum output entropies of random quantum channels. These bounds rely on random matrix techniques arising from free probability theory. We then revisit the counterexamples developed by Hayden and Winter to get violations of the additivity equalities for minimum output Rényi entropies. We show that random channels obtained by randomly coupling the input to a qubit violate the additivity of the p-Rényi entropy, for all p>1. For some sequences of random quantum channels, we compute almost surely the limit of their Schatten S1Sp norms.  相似文献   

18.
高质量的决策越来越依赖于高质量的数据挖掘及其分析,高质量的数据挖掘离不开高质量的数据.在大型仪器利用情况调查中,由于主客观因素,总是致使有些数据出现异常,影响数据的质量.这就需要通过适用的方法对异常数据进行检测处理.不同类型数据往往需要不同的异常值检测方法.分析了大型仪器利用情况调查数据的总体特点、一般方法,并以国家科技部平台中心主持的"我国大型仪器资源现状调查"(2009)中大型仪器使用机时和共享机时数据为主线,比较研究了回归方法、基于深度的方法和箱线图方法等对不同类型数据异常值检测的适用性.选取不同角度,检验并采用不同的适用方法,找出相关的可疑异常值,有助于下一步有效开展大型仪器利用情况异常数据的分析处理,提高数据质量,为大型仪器利用情况综合评价奠定基础,也为科技资源调查数据预处理中异常值检测方法提供有益借鉴.  相似文献   

19.
A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.  相似文献   

20.
This paper examines the relative efficiency of alternative methods of producing care for the developmentally disabled. A linear programming framework is used to construct a production frontier which allows measurement of relative efficiency among institutions in the sample. Tests are performed to detect influential observations in the data which might result from measurement error which could distort the efficiency measures. Different types of institutions are compared in terms of average efficiency. Policy implications of the analysis are discussed in the concluding section.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号