首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Advances in Data Analysis and Classification - We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed...  相似文献   

2.
Many applications in data analysis rely on the decomposition of a data matrix into a low-rank and a sparse component. Existing methods that tackle this task use the nuclear norm and \(\ell _1\) -cost functions as convex relaxations of the rank constraint and the sparsity measure, respectively, or employ thresholding techniques. We propose a method that allows for reconstructing and tracking a subspace of upper-bounded dimension from incomplete and corrupted observations. It does not require any a priori information about the number of outliers. The core of our algorithm is an intrinsic Conjugate Gradient method on the set of orthogonal projection matrices, the so-called Grassmannian. Non-convex sparsity measures are used for outlier detection, which leads to improved performance in terms of robustly recovering and tracking the low-rank matrix. In particular, our approach can cope with more outliers and with an underlying matrix of higher rank than other state-of-the-art methods.  相似文献   

3.

Euclidean embedding from noisy observations containing outlier errors is an important and challenging problem in statistics and machine learning. Many existing methods would struggle with outliers due to a lack of detection ability. In this paper, we propose a matrix optimization based embedding model that can produce reliable embeddings and identify the outliers jointly. We show that the estimators obtained by the proposed method satisfy a non-asymptotic risk bound, implying that the model provides a high accuracy estimator with high probability when the order of the sample size is roughly the degree of freedom up to a logarithmic factor. Moreover, we show that under some mild conditions, the proposed model also can identify the outliers without any prior information with high probability. Finally, numerical experiments demonstrate that the matrix optimization-based model can produce configurations of high quality and successfully identify outliers even for large networks.

  相似文献   

4.
Cluster-based outlier detection   总被引:1,自引:0,他引:1  
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986, 2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points.  相似文献   

5.
传统线性模型异常点识别方法容易发生误判:正常点被归为异常点或者异常点被归为正常点.为解决此类问题,提出了应用逆跳马尔科夫蒙特卡洛方法识别异常点的思想,同时将其应用于实际数据加以检验,识别效果明显好于传统方法.  相似文献   

6.
Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, epidemic algorithms (EA) and transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package ${\tt{rrcovNA}}$ which is available from the Comprehensive R Archive Network (CRAN) at http://www.CRAN.R-project.org under the GNU General Public License.  相似文献   

7.
Summary  The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets, a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as a useful technique for multidimensional outlier detection.  相似文献   

8.
In this paper we introduce COV, a novel information retrieval (IR) algorithm for massive databases based on vector space modeling and spectral analysis of the covariance matrix, for the document vectors, to reduce the scale of the problem. Since the dimension of the covariance matrix depends on the attribute space and is independent of the number of documents, COV can be applied to databases that are too massive for methods based on the singular value decomposition of the document-attribute matrix, such as latent semantic indexing (LSI). In addition to improved scalability, theoretical considerations indicate that results from our algorithm tend to be more accurate than those from LSI, particularly in detecting subtle differences in document vectors. We demonstrate the power and accuracy of COV through an important topic in data mining, known as outlier cluster detection. We propose two new algorithms for detecting major and outlier clusters in databases—the first is based on LSI, and the second on COV. Our implementation studies indicate that our cluster detection algorithms outperform the basic LSI and COV algorithm in detecting outlier clusters.  相似文献   

9.
We consider the problem of deleting bad influential observations (outliers) in linear regression models. The problem is formulated as a Quadratic Mixed Integer Programming (QMIP) problem, where penalty costs for discarding outliers are used into the objective function. The optimum solution defines a robust regression estimator called penalized trimmed squares (PTS). Due to the high computational complexity of the resulting QMIP problem, the proposed robust procedure is computationally suitable for small sample data. The computational performance and the effectiveness of the new procedure are improved significantly by using the idea of ε-Insensitive loss function from support vectors machine regression. Small errors are ignored, and the mathematical formula gains the sparseness property. The good performance of the ε-Insensitive PTS (IPTS) estimator allows identification of multiple outliers avoiding masking or swamping effects. The computational effectiveness and successful outlier detection of the proposed method is demonstrated via simulated experiments. This research has been partially funded by the Greek Ministry of Education under the program Pythagoras II.  相似文献   

10.
Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.  相似文献   

11.

Measuring dependence is a very important tool to analyze pairs of functional data. The coefficients currently available to quantify association between two sets of curves show a non robust behavior under the presence of outliers. We propose a new robust numerical measure of association for bivariate functional data. We extend in this paper Kendall coefficient for finite dimensional observations to the functional setting. We also study its statistical properties. An extensive simulation study shows the good behavior of this new measure for different types of functional data. Moreover, we apply it to establish association for real data, including microarrays time series in genetics.

  相似文献   

12.
This article proposes a new technique for detecting outliers in autoregressive models and identifying the type as either innovation or additive. This technique can be used without knowledge of the true model order, outlier location, or outlier type. Specifically, we perturb an observation to obtain the perturbation size that minimizes the resulting residual sum of squares (SSE). The reduction in the SSE yields outlier detection and identification measures. In addition, the perturbation size can be used to gauge the magnitude of the outlier. Monte Carlo studies and empirical examples are presented to illustrate the performance of the proposed method as well as the impact of outliers on model selection and parameter estimation. We also obtain robust estimators and model selection criteria, which are shown in simulation studies to perform well when large outliers occur.  相似文献   

13.
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.  相似文献   

14.
聂斌  王曦  胡雪 《运筹与管理》2019,28(1):101-107
在质量控制领域,非线性轮廓异常点识别问题是重点研究问题之一。本文综合运用了小波分析、数据深度、聚类分析等数据分析处理技术,提出了一种新的非正态变异的异常点识别方法。文章通过仿真分析技术,将新方法χ2与控制图方法进行性能对比,结果证实新方法能够以更高的准确率和稳定性识别异常点,表现出更好的异常点识别性能。最后将新方法应用于木板垂直密度轮廓实例对新方法进行验证,分析结果表明本方法能够有效识别出异常轮廓数据。  相似文献   

15.
Abstract

The importance of histograms and boxplots as exploratory data analysis (EDA) tools has been well established. Yet, adopting them for lifetime data is not straightforward. The first problem, particularly in histograms, is how to deal with censored observations. The second problem is that the underlying distribution of lifetime data is often highly positively skewed. Therefore, in the classical boxplot, large observations can be wrongly diagnosed as outliers. This article extends the definition of the histogram to accommodate for right-censored observations. We apply the “redistribution to the right” method so that the weight of each uncensored observation is actually the jump of the Kaplan—Meier estimation at this point. We also propose modified boxplots to resolve both problems of right censoring and skewness. In our procedure, the fences differ from those of the classical boxplot. Simulation results are presented for an evaluation of our procedure and an example is presented for illustration.  相似文献   

16.
17.
In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results, particularly for two-stage analyses. In the DEA-related literature, prior work on this issue has focused on the efficient frontier as a basis for detecting outliers. An iterative approach for dealing with the potential for one outlier to mask the presence of another has been proposed but not demonstrated. This paper proposes using both the efficient frontier and the inefficient frontier to identify outliers and thereby improve the accuracy of second stage results in two-stage nonparametric analysis. The iterative outlier detection approach is implemented in a leave-one-out method using both the efficient frontier and the inefficient frontier and demonstrated in a two-stage semi-parametric bootstrapping analysis of a classic data set. The results show that the conclusions drawn can be different when outlier identification includes consideration of the inefficient frontier.  相似文献   

18.
高质量的决策越来越依赖于高质量的数据挖掘及其分析,高质量的数据挖掘离不开高质量的数据.在大型仪器利用情况调查中,由于主客观因素,总是致使有些数据出现异常,影响数据的质量.这就需要通过适用的方法对异常数据进行检测处理.不同类型数据往往需要不同的异常值检测方法.分析了大型仪器利用情况调查数据的总体特点、一般方法,并以国家科技部平台中心主持的"我国大型仪器资源现状调查"(2009)中大型仪器使用机时和共享机时数据为主线,比较研究了回归方法、基于深度的方法和箱线图方法等对不同类型数据异常值检测的适用性.选取不同角度,检验并采用不同的适用方法,找出相关的可疑异常值,有助于下一步有效开展大型仪器利用情况异常数据的分析处理,提高数据质量,为大型仪器利用情况综合评价奠定基础,也为科技资源调查数据预处理中异常值检测方法提供有益借鉴.  相似文献   

19.
Abstract

The massive flood of numbers in ongoing large-scale periodic economic and social surveys commonly leaves little time for anything but a cursory examination of the quality of the data, and few techniques exist for giving an overview of data activity. At the U.S. Bureau of Labor Statistics, a graphical and query-based solution to these problems has recently been adopted for data review in the Current Employment Statistics survey. Chief among the motivations for creating the new system were: (1) Reduce or eliminate the arduous paper review of thousands of sample reports by review analysts; (2) allow the review analysts a more global view of sample activity and at the same time make outlier detection less of a strain; and (3) present global views of estimates over time and among groups of subestimates. The specific graphics approaches used in the new system were designed to quickly portray both time series and cross-sectional aspects of the data, as these are both critical elements in the review process. The described system allows the data analysts to track down suspicious sample members by first graphically pinpointing questionable estimates, and then pinpointing questionable sample data used to produce those estimates. Query methods are used for cross-checking relationships among different sample data elements. Although designed for outlier detection and estimation, the data-representation methods employed in the system have opened up new possibilities for further statistical and economic uses of the data. The authors were torn between the desire for a completely automatic system of data review and the practical demands of an actual survey operating under imperfect conditions, and thus viewed the new system as an evolutionary advance, not as an ideal final solution. Possibilities opened up by the new system prompted some further thinking on finding an ideal state.  相似文献   

20.
This article considers the problem of detecting outliers in time series data and proposes a general detection method based on wavelets. Unlike other detection procedures found in the literature, our method does not require that a model be specified for the data. Also, use of our method is not restricted to data generated from ARIMA processes. The effectiveness of the proposed method is compared with existing outlier detection procedures. Comparisons based on various models, sample sizes, and parameter values illustrate the effectiveness of the proposed method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号