首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.  相似文献   

2.
传统线性模型异常点识别方法容易发生误判:正常点被归为异常点或者异常点被归为正常点.为解决此类问题,提出了应用逆跳马尔科夫蒙特卡洛方法识别异常点的思想,同时将其应用于实际数据加以检验,识别效果明显好于传统方法.  相似文献   

3.
Cluster-based outlier detection   总被引:1,自引:0,他引:1  
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986, 2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points.  相似文献   

4.
The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in Zioutas and Avramidis (2005) Acta Math Appl Sin 21:323–334, Zioutas et al. (2007) REVSTAT 5:115–136 and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.  相似文献   

5.
We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions.

By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers.

An R-package containing computer code and datasets is available in the online supplements.  相似文献   

6.
This article proposes a new technique for detecting outliers in autoregressive models and identifying the type as either innovation or additive. This technique can be used without knowledge of the true model order, outlier location, or outlier type. Specifically, we perturb an observation to obtain the perturbation size that minimizes the resulting residual sum of squares (SSE). The reduction in the SSE yields outlier detection and identification measures. In addition, the perturbation size can be used to gauge the magnitude of the outlier. Monte Carlo studies and empirical examples are presented to illustrate the performance of the proposed method as well as the impact of outliers on model selection and parameter estimation. We also obtain robust estimators and model selection criteria, which are shown in simulation studies to perform well when large outliers occur.  相似文献   

7.
Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.  相似文献   

8.

Euclidean embedding from noisy observations containing outlier errors is an important and challenging problem in statistics and machine learning. Many existing methods would struggle with outliers due to a lack of detection ability. In this paper, we propose a matrix optimization based embedding model that can produce reliable embeddings and identify the outliers jointly. We show that the estimators obtained by the proposed method satisfy a non-asymptotic risk bound, implying that the model provides a high accuracy estimator with high probability when the order of the sample size is roughly the degree of freedom up to a logarithmic factor. Moreover, we show that under some mild conditions, the proposed model also can identify the outliers without any prior information with high probability. Finally, numerical experiments demonstrate that the matrix optimization-based model can produce configurations of high quality and successfully identify outliers even for large networks.

  相似文献   

9.
指数样本中多个异常值的Unmasking检验   总被引:2,自引:0,他引:2  
指数样本中多个异常值的非一致性检验因受masking或swamping效应的影响而变得十分的困难和复杂,解决这一问题的关键在于K值的确定,传统的方法是无能为力的.本文基于变量选择的AIC准则的思想提出了异常值检验的一种新方法,它具有不预先指定k,计算简单且通过达到极大化MAIC就能达到确定k和消除检验中的masking或swamping的优点.还给出了易计算检验显著水平的统计量和公式.最后,通过实例的验证标明本文方法的有效性.  相似文献   

10.
High leverage points have tremendous effect in linear regression analysis. When a group of high leverage points is present in a dataset, the existing detection methods fail to detect them correctly. This problem is due to the masking and swamping effects. We propose the Diagnostic Robust Generalized Potentials Based on Index Set Equality (DRGP(ISE)) in this regard. The DRGP(ISE) takes off from the Diagnostic Robust Generalized Potential Based on Minimum Volume Ellipsoid (DRGP(MVE)). However, the running time of ISE is much faster than MVE. Monte Carlo simulation study and numerical data indicate that DRGP(ISE) works excellently to detect the actual high leverage points and reduce masking and swamping effects in a linear model.  相似文献   

11.
The concern over outliers is old since Bernoulli (see [12]), reviewed historically by [11] and updated with [10] in their encyclopedia textbook. James et al.~([46]) used simulation technique to compare some recent published outlier detection procedures.The history of adept and diagnosis of outliers is traced from old and presence comments. Theil-type or Rank, Brown-Mood, L_p, M, adaptive M, GM, and Trimmed-Winsorization estimators are the most popular estimators that we will review in this paper as an application to outlier accommodation. We will review and compare the most numerical and graphical displays based on residuals to flag outliers.  相似文献   

12.
In this article we implement a forward search algorithm for identifying atypical subjects/observations in factor analysis models for binary data. Forward plots of goodness-of-fit statistics, residuals, and parameter estimates help us identify aberrant observations and detect deviations from the hypothesized model. Methods to initialize, progress, and monitor the search are explored. Simulation envelopes are constructed to investigate whether changes in the statistics being monitored are solely due to random variation. One real and two simulated datasets are used to illustrate the performance of the suggested algorithm. The two simulated datasets explore the effectiveness of the method in the presence of a single outlier and a cluster of outliers. Matlab computer code for implementing the proposed methods is available online.  相似文献   

13.
The outlier detection problem and the robust covariance estimation problem are often interchangeable. Without outliers, the classical method of maximum likelihood estimation (MLE) can be used to estimate parameters of a known distribution from observational data. When outliers are present, they dominate the log likelihood function causing the MLE estimators to be pulled toward them. Many robust statistical methods have been developed to detect outliers and to produce estimators that are robust against deviation from model assumptions. However, the existing methods suffer either from computational complexity when problem size increases or from giving up desirable properties, such as affine equivariance. An alternative approach is to design a special mathematical programming model to find the optimal weights for all the observations, such that at the optimal solution, outliers are given smaller weights and can be detected. This method produces a covariance estimator that has the following properties: First, it is affine equivariant. Second, it is computationally efficient even for large problem sizes. Third, it easy to incorporate prior beliefs into the estimator by using semi-definite programming. The accuracy of this method is tested for different contamination models, including recently proposed ones. The method is not only faster than the Fast-MCD method for high dimensional data but also has reasonable accuracy for the tested cases.  相似文献   

14.
We propose a number of diagnostic methods that can be used whenever multiple outliers are identified by robust estimates for multivariate location and scatter. Their main purpose is visualization of the multivariate data to help determine whether the detected outliers (a) form separate clusters or (b) are isolated or randomly scattered (such as heavy tails compared with Gaussian). We make use of Mahalanobis distances and linear projections, to check for separation and to reveal additional aspects of the data structure. Several real data examples are analyzed, and artificial examples are used to illustrate the diagnostic power of the proposed plots.

Code to perform the diagnostics, datasets used as examples in the article and documention are available in the online supplements.  相似文献   

15.
This article extends the analysis of multivariate transformations to linear and quadratic discriminant analysis. It shows that the standard application of deletion diagnostic techniques for validating a particular transformation suffers from masking and so may fail if several outliers are present. We therefore suggest a simple and powerful method which is based on a forward search algorithm. This robust diagnostic procedure orders the observations from those most in agreement with the suggested model to those least in agreement with it. It provides a unified approach to the detection of inuential observations and outliers in discriminant analysis. Simulated and real data are used to show the necessity of considering multivariate transformations in discriminant analysis. The examples demonstrate the power of the suggested approach in revealing the correct structure of the data when this is obscured by outliers.  相似文献   

16.
Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, epidemic algorithms (EA) and transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package ${\tt{rrcovNA}}$ which is available from the Comprehensive R Archive Network (CRAN) at http://www.CRAN.R-project.org under the GNU General Public License.  相似文献   

17.
In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results, particularly for two-stage analyses. In the DEA-related literature, prior work on this issue has focused on the efficient frontier as a basis for detecting outliers. An iterative approach for dealing with the potential for one outlier to mask the presence of another has been proposed but not demonstrated. This paper proposes using both the efficient frontier and the inefficient frontier to identify outliers and thereby improve the accuracy of second stage results in two-stage nonparametric analysis. The iterative outlier detection approach is implemented in a leave-one-out method using both the efficient frontier and the inefficient frontier and demonstrated in a two-stage semi-parametric bootstrapping analysis of a classic data set. The results show that the conclusions drawn can be different when outlier identification includes consideration of the inefficient frontier.  相似文献   

18.
In this paper we tackle the problem of outlier detection in data envelopment analysis (DEA). We propose a procedure where we merge the super-efficiency DEA and the forward search. Since DEA provides efficiency scores which are not parameters to fit the model to the data, we introduce a distance, to be monitored along the search. This distance is obtained through the integration of a regression model and the super-efficiency DEA. We simulate a Cobb-Douglas production function and we compare the super-efficiency DEA and the forward search analysis in both uncontaminated and contaminated settings. For inference about outliers, we exploit envelopes obtained through Monte Carlo simulations.  相似文献   

19.
We consider the problem of deleting bad influential observations (outliers) in linear regression models. The problem is formulated as a Quadratic Mixed Integer Programming (QMIP) problem, where penalty costs for discarding outliers are used into the objective function. The optimum solution defines a robust regression estimator called penalized trimmed squares (PTS). Due to the high computational complexity of the resulting QMIP problem, the proposed robust procedure is computationally suitable for small sample data. The computational performance and the effectiveness of the new procedure are improved significantly by using the idea of ε-Insensitive loss function from support vectors machine regression. Small errors are ignored, and the mathematical formula gains the sparseness property. The good performance of the ε-Insensitive PTS (IPTS) estimator allows identification of multiple outliers avoiding masking or swamping effects. The computational effectiveness and successful outlier detection of the proposed method is demonstrated via simulated experiments. This research has been partially funded by the Greek Ministry of Education under the program Pythagoras II.  相似文献   

20.
In this paper, we propose a robust L1-norm non-parallel proximal support vector machine (L1-NPSVM), which aims at giving a robust performance for binary classification in contrast to GEPSVM, especially for the problem with outliers. There are three mainly properties of the proposed L1-NPSVM. Firstly, different from the traditional GEPSVM which solves two generalized eigenvalue problems, our L1-NPSVM solves a pair of L1-norm optimal problems by using a simple justifiable iterative technique. Secondly, by introducing the L1-norm, our L1-NPSVM is more robust to outliers than GEPSVM to a great extent. Thirdly, compared with GEPSVM, no parameters need to be regularized in our L1-NPSVM. The effectiveness of the proposed method is demonstrated by tests on a simple artificial example as well as on some UCI datasets, which shows the improvements of GEPSVM.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号