共查询到20条相似文献,搜索用时 31 毫秒
1.
Influential observations in frontier models,a robust non-oriented approach to the water sector 总被引:1,自引:0,他引:1
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and
exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential
outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select
five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity
of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical
by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied
to the Portuguese drinking water sector, for which we have an unusually rich data set. 相似文献
2.
Summary The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability
of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of
techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast
and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of
the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional
outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies
the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial
intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very
large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets,
a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as
a useful technique for multidimensional outlier detection. 相似文献
3.
Zhang Aiwu 《应用概率统计》2012,28(6):625-636
Least squares method based on Euclidean
distance and Lebesgue distance between fuzzy data is used to study
parameter estimation of fuzzy linear regression model based on case
deletion respectively. And the parameter estimations on two kinds of
distance are compared. The input of the above model is real data and
output is fuzzy data. The statistical diagnosis --- estimation
standard error of regression equations is constructed to test highly
influential point or outlier in observation data. At last through
identifying highly influential point or outlier in actual data, it
shows that the statistic constructed in this paper is effective. 相似文献
4.
Xu-Ping Zhong Bo-Cheng Wei Wing-Kam Fung 《Annals of the Institute of Statistical Mathematics》2000,52(2):367-379
In this paper, we present a unified diagnostic method for linear measurement error models based upon the corrected likelihood of Nakamura (1990, Biometrika, 77, 127–137). Both global influence and local influence are discussed. The case-deletion model and mean-shift outlier model are considered, and they are shown to be approximately equivalent. Several diagnostic measures are derived and discussed. It is found that they can be written in terms of the residual and leverage measure. Some existing results are improved. Numerical example illustrates that our method is useful for diagnosing influential observations. 相似文献
5.
空间变系数模型的统计诊断 总被引:1,自引:0,他引:1
空间变系数模型作为一类有效的空间数据分析方法已经得到了广泛的应用.本文主要研究该模型的统计诊断与影响分析方法。首先我们基于数据删除模型定义了Cook统计量,其次我们基于均值漂移模型讨论了异常点的检验问题。 相似文献
6.
Azami Zaharim Amiruddin Ismail Shahrum Abdullah Ibrahim Mohamed Ibrahim Ahmad 《PAMM》2007,7(1):2030041-2030042
A study is carried out to investigate the sampling properties of the outlier test statistics of a procedure developed for detecting temporary change in BL(1,1,1,1) processes. It is done with respect to the sample size, the type of outlier and the size of the coefficients of the BL(1,1,1,1) process. The results show that, in general, the outlier detection procedure is capable of detecting TC, although the performance is affected if ω is large. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim) 相似文献
7.
The chain-ladder method is a widely used technique to forecast the reserves that have to be kept regarding claims that are known to exist, but for which the actual size is unknown at the time the reserves have to be set. In practice it can be easily seen that even one outlier can lead to a huge over- or underestimation of the overall reserve when using the chain-ladder method. This indicates that individual claims can be very influential when determining the chain-ladder estimates. In this paper the effect of contamination is mathematically analyzed by calculating influence functions in the generalized linear model framework corresponding to the chain-ladder method. It is proven that the influence functions are unbounded, confirming the sensitivity of the chain-ladder method to outliers. A robust alternative is introduced to estimate the generalized linear model parameters in a more outlier resistant way. Finally, based on the influence functions and the robust estimators, a diagnostic tool is presented highlighting the influence of every individual claim on the classical chain-ladder estimates. With this tool it is possible to detect immediately which claims have an abnormally positive or negative influence on the reserve estimates. Further examination of these influential points is then advisable. A study of artificial and real run-off triangles shows the good performance of the robust chain-ladder method and the diagnostic tool. 相似文献
8.
异常点诊断是统计学中的经典问题.发现并减少异常点对纳税评估数据分析的影响是一项很有意义的研究.然而,通常的异常点诊断一般采用适用于单峰分布的全局识别方法.借鉴局部域相关积分(Local correlation integral)理论,提出基于非参数密度估计的识别方法.方法适用于多峰分布,能识别局域性质的异常点,对异常点占比较高的样本也有较强的识别能力.基于某市10 920个企业样本,实证分析对比研究了税务局目前使用的和建议的纳税评估方法,结果表明税务局采用的方法有较大的纳税评估风险(误判风险). 相似文献
9.
10.
In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results, particularly for two-stage analyses. In the DEA-related literature, prior work on this issue has focused on the efficient frontier as a basis for detecting outliers. An iterative approach for dealing with the potential for one outlier to mask the presence of another has been proposed but not demonstrated. This paper proposes using both the efficient frontier and the inefficient frontier to identify outliers and thereby improve the accuracy of second stage results in two-stage nonparametric analysis. The iterative outlier detection approach is implemented in a leave-one-out method using both the efficient frontier and the inefficient frontier and demonstrated in a two-stage semi-parametric bootstrapping analysis of a classic data set. The results show that the conclusions drawn can be different when outlier identification includes consideration of the inefficient frontier. 相似文献
11.
《Journal of Computational and Applied Mathematics》2002,149(1):119-129
In this paper we introduce COV, a novel information retrieval (IR) algorithm for massive databases based on vector space modeling and spectral analysis of the covariance matrix, for the document vectors, to reduce the scale of the problem. Since the dimension of the covariance matrix depends on the attribute space and is independent of the number of documents, COV can be applied to databases that are too massive for methods based on the singular value decomposition of the document-attribute matrix, such as latent semantic indexing (LSI). In addition to improved scalability, theoretical considerations indicate that results from our algorithm tend to be more accurate than those from LSI, particularly in detecting subtle differences in document vectors. We demonstrate the power and accuracy of COV through an important topic in data mining, known as outlier cluster detection. We propose two new algorithms for detecting major and outlier clusters in databases—the first is based on LSI, and the second on COV. Our implementation studies indicate that our cluster detection algorithms outperform the basic LSI and COV algorithm in detecting outlier clusters. 相似文献
12.
We consider the problem of deleting bad influential observations (outliers) in linear regression models. The problem is formulated
as a Quadratic Mixed Integer Programming (QMIP) problem, where penalty costs for discarding outliers are used into the objective
function. The optimum solution defines a robust regression estimator called penalized trimmed squares (PTS). Due to the high
computational complexity of the resulting QMIP problem, the proposed robust procedure is computationally suitable for small
sample data. The computational performance and the effectiveness of the new procedure are improved significantly by using
the idea of ε-Insensitive loss function from support vectors machine regression. Small errors are ignored, and the mathematical formula
gains the sparseness property. The good performance of the ε-Insensitive PTS (IPTS) estimator allows identification of multiple outliers avoiding masking or swamping effects. The computational
effectiveness and successful outlier detection of the proposed method is demonstrated via simulated experiments.
This research has been partially funded by the Greek Ministry of Education under the program Pythagoras II. 相似文献
13.
传统线性模型异常点识别方法容易发生误判:正常点被归为异常点或者异常点被归为正常点.为解决此类问题,提出了应用逆跳马尔科夫蒙特卡洛方法识别异常点的思想,同时将其应用于实际数据加以检验,识别效果明显好于传统方法. 相似文献
14.
Cluster-based outlier detection 总被引:1,自引:0,他引:1
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis,
and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or
inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key
observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need
to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In
this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to
the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986,
2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International
Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points. 相似文献
15.
16.
This paper proposes a statistical approach to handle the problem of detecting influential observations in deterministic nonparametric
Data Envelopment Analysis (DEA) models. We use the bootstrap method to estimate the underlying distribution for efficiency
scores in order to avoid making unrealistic assumptions about the true distribution. To measure whether a specific DMU is
truly influential, we employ relative entropy to detect the change in the distribution after the DMU in question is removed.
A statistical test has been applied to determine the significance level. Two examples from the literature are discussed and
comparisons to previous methods are provided. 相似文献
17.
In this paper we obtain new bounds for the minimum output entropies of random quantum channels. These bounds rely on random matrix techniques arising from free probability theory. We then revisit the counterexamples developed by Hayden and Winter to get violations of the additivity equalities for minimum output Rényi entropies. We show that random channels obtained by randomly coupling the input to a qubit violate the additivity of the p-Rényi entropy, for all p>1. For some sequences of random quantum channels, we compute almost surely the limit of their Schatten S1→Sp norms. 相似文献
18.
高质量的决策越来越依赖于高质量的数据挖掘及其分析,高质量的数据挖掘离不开高质量的数据.在大型仪器利用情况调查中,由于主客观因素,总是致使有些数据出现异常,影响数据的质量.这就需要通过适用的方法对异常数据进行检测处理.不同类型数据往往需要不同的异常值检测方法.分析了大型仪器利用情况调查数据的总体特点、一般方法,并以国家科技部平台中心主持的"我国大型仪器资源现状调查"(2009)中大型仪器使用机时和共享机时数据为主线,比较研究了回归方法、基于深度的方法和箱线图方法等对不同类型数据异常值检测的适用性.选取不同角度,检验并采用不同的适用方法,找出相关的可疑异常值,有助于下一步有效开展大型仪器利用情况异常数据的分析处理,提高数据质量,为大型仪器利用情况综合评价奠定基础,也为科技资源调查数据预处理中异常值检测方法提供有益借鉴. 相似文献
19.
A. Pedro Duarte Silva Peter Filzmoser Paula Brito 《Advances in Data Analysis and Classification》2018,12(3):785-822
A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data. 相似文献
20.
This paper examines the relative efficiency of alternative methods of producing care for the developmentally disabled. A linear programming framework is used to construct a production frontier which allows measurement of relative efficiency among institutions in the sample. Tests are performed to detect influential observations in the data which might result from measurement error which could distort the efficiency measures. Different types of institutions are compared in terms of average efficiency. Policy implications of the analysis are discussed in the concluding section. 相似文献