共查询到20条相似文献,搜索用时 169 毫秒
1.
《应用概率统计》2021,(2)
对模型精度与稳健性的要求使得异常值检测与稳健估计在模型构建中变得日益重要.本文首先利用基于边际相关系数构造的高维影响度量指标(HIM)与基于距离相关系数构造的高维数据异常值判别方法(HDC)分别对数据中的异常值进行初步检测,将数据集中的点分为正常点与异常点两类,然后在初始正常点集的基础上利用稳健的参数估计方法和残差空间超椭球等高面的概念构造了对初始正常点集中误判点的纠正方法,并对初始异常点集中各点的异常值概率重新进行计算,以进一步纠正误判入异常点集的正常点,最终对异常值检测的准确率进行进一步的提升.通过对两种数据结构下三种不同类型异常数据的模拟,证明了所提方法的有效性,并通过实例进行验证与分析. 相似文献
2.
针对综合评价中的异常值现象,讨论了原始数据中是否存在异常值、若存在异常值该如何识别异常值以及对含有异常值的评价数据如何进行无量纲化处理三个问题。关于异常值的判断与识别,给出了以“中位数”为参考点,通过比较排序后两端数据偏离中位数的距离的处理思路。对含有异常值的评价数据的无量纲化处理问题,基于常用的“极值处理法”,通过分别指定异常值和非异常值无量纲化取值区间的方式,提出了一种分段的无量纲化处理方法。最后,通过与已有文献异常值识别及无量纲化处理结果的对比分析,验证了本文方法的有效性,发现本文给出的方法能够实现对异常值的适度筛选,且能够提升无量纲化数据分布均衡性。 相似文献
3.
4.
基于异常值对异质性参数和回归系数估计同时影响的这一新视角下,文章利用方差加权异常值模型(variance-weight outlier model,VWOM)研究了随机效应Meta回归模型的多个异常值识别及其修正问题。首先,推导出Meta回归VWOM分别使用ML和REML估计方法的Score (SC)检验统计量,并考虑Meta回归VWOM的三种扰动方式,包括全局方差扰动,个体方差扰动和随机误差扰动,证明了三种方差扰动的SC检验统计量是等价的。其次,基于异常值对异质性参数和回归系数估计同时影响的考虑,提出了随机效应Meta回归方差加权异常值修正模型(variance-weight outlier modified model,VWOMM),并给出了VWOMM参数的ML和REML估计迭代算法并进行数值求解。此外,通过随机模拟分析验证了SC检验统计量的尺度和功效。最后,利用两个不同类型效应量异常值识别及其处理的实例分析结果,表明了Meta回归VWOM的SC检验统计量识别效果较为显著,VWOMM能有效改善模型拟合程度,为识别和处理复杂数据的异常值提供了一种新的思路和方法。 相似文献
5.
6.
可拓数据挖掘在高校教学质量评价中的应用 总被引:3,自引:1,他引:2
高校在教学和管理工作中积累了大量的数据,但这些数据没有得到有效利用.将可拓数据挖掘技术引入教学领域,从教学评价数据中提取出隐藏在数据之中的有用信息,为教学管理者提供决策支持.首先通过可拓分析,寻找质量达到要求、可进行有效挖掘的教学评价数据,然后对这些数据进行两方面的挖掘:影响教学质量的关键因素挖掘、教学质量与教师特征之间的关联规则挖掘. 相似文献
7.
针对原有千车故障数统计方法上的不足,本文从改进统计方法着手,提出一种新的统计方法即重新定义千车故障数,然后利用数据挖掘中的聚类分析方法将具有相同特征的批次综合起来考虑,建立通用的运筹模型.针对缺失数据、近期预测这两个问题,本文对通用模型进行调整,“学习”出同类数据间的不同权值,然后利用加权数据,并通过拟合曲线来求出预测值.由于远期预测中数据的严重缺乏,则是从纯粹统计学的角度出发,计算得到预测值.预测模型通用性强,适用面较广.本文应用了SAS和MATLAB两种软件来求解上述模型,预测结果准确率较高,并且符合实际情况. 相似文献
8.
目前基于遥测数据的卫星在轨状态监测和异常检测,主要通过频谱分析等信号处理方法提取遥测特征,难以适应卫星遥测数据离散取值、数据量大、异常和噪声复杂等特点,所提取的特征量特点不突出,难以满足遥测数据异常检测要求.提出基于波动特征的卫星遥测数据特征提取方法,以遥测数据变化频数或累积变化次数作为卫星遥测数据特征,具有实现简单、快速高效、对异常数据不敏感等特点.基于所提取的波动特征,提出一种基于序贯概率比检验(SPRT)的卫星在轨异常检测方法.实例分析结果表明,所提取的特征量能够较好地识别卫星异常,具有较高的计算效率和较好的检测性能. 相似文献
9.
针对ARMA模型建模过程中模型识别和参数估计易受观测值异常点影响问题,构建了同时考虑加性异常点和更新性异常点的ARMA模型.运用基于Gibbs抽样的Markov Chain Monte Carlo贝叶斯方法,估计稳健ARMA模型参数,同步确定观测值中异常点的位置,辨别异常点类型.并利用我国人口自然增长数据进行仿真分析,研究结果表明:贝叶斯方法能够有效地识别ARMA序列的异常点. 相似文献
10.
针对包含多个正常类的多元数据异常检测问题,提出了一种基于多分类马田系统的半监督数据异常检测方法.通过对训练数据集中的每个正常类分别建立马氏空间,构建了基于马氏距离的多类测量尺度,方法对测试数据集中正常数据进行分类的同时,能够实现对异常数据的检测.通过模拟带异常值的高斯混合模型数据验证了该方法的有效性. 相似文献
11.
12.
Richard Esposito James K. Fox Dongyow Lin Kevin Tidemann 《Journal of computational and graphical statistics》2013,22(2):113-125
Abstract The massive flood of numbers in ongoing large-scale periodic economic and social surveys commonly leaves little time for anything but a cursory examination of the quality of the data, and few techniques exist for giving an overview of data activity. At the U.S. Bureau of Labor Statistics, a graphical and query-based solution to these problems has recently been adopted for data review in the Current Employment Statistics survey. Chief among the motivations for creating the new system were: (1) Reduce or eliminate the arduous paper review of thousands of sample reports by review analysts; (2) allow the review analysts a more global view of sample activity and at the same time make outlier detection less of a strain; and (3) present global views of estimates over time and among groups of subestimates. The specific graphics approaches used in the new system were designed to quickly portray both time series and cross-sectional aspects of the data, as these are both critical elements in the review process. The described system allows the data analysts to track down suspicious sample members by first graphically pinpointing questionable estimates, and then pinpointing questionable sample data used to produce those estimates. Query methods are used for cross-checking relationships among different sample data elements. Although designed for outlier detection and estimation, the data-representation methods employed in the system have opened up new possibilities for further statistical and economic uses of the data. The authors were torn between the desire for a completely automatic system of data review and the practical demands of an actual survey operating under imperfect conditions, and thus viewed the new system as an evolutionary advance, not as an ideal final solution. Possibilities opened up by the new system prompted some further thinking on finding an ideal state. 相似文献
13.
Summary The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability
of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of
techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast
and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of
the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional
outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies
the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial
intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very
large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets,
a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as
a useful technique for multidimensional outlier detection. 相似文献
14.
HuYang TingYang 《应用数学学报(英文版)》2005,21(2):303-310
Outlier mining is an important aspect in data mining and the outlier mining based on Cook distance is most commonly used. But we know that when the data have multicoUinearity, the traditional Cook method is no longer effective. Considering the excellence of the principal component estimation, we use it to substitute the least squares estimation, and then give the Cook distance measurement based on principal component estimation, which can be used in outlier mining. At the same time, we have done some research on related theories and application problems. 相似文献
15.
Ojo Oluwasegun Taiwo Fernández Anta Antonio Lillo Rosa E. Sguera Carlo 《Advances in Data Analysis and Classification》2022,16(3):725-760
Advances in Data Analysis and Classification - We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed... 相似文献
16.
Robert Kosara 《Journal of computational and graphical statistics》2013,22(1):29-32
We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions. By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers. An R-package containing computer code and datasets is available in the online supplements. 相似文献
17.
Cluster-based outlier detection 总被引:1,自引:0,他引:1
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis,
and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or
inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key
observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need
to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In
this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to
the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986,
2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International
Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points. 相似文献
18.
Multiple-Project Scheduling with Controllable Project Duration and Hard Resource Constraint: Some Solvable Cases 总被引:1,自引:0,他引:1
In many large-scale project scheduling problems, multiple projects are either taking place at the same time or scheduled into a tight sequence in order to efficiently share a common resource. One example of this is the computing resource allocation at an Application Service Provider (ASP) which provides data processing services for multiple paying customers. Typical services provided by ASPs are data mining, payroll processing, internet-based storage backup services and Customer Relation Management (CRM) services. The processing mode of an ASP can be either batch or concurrent, depending on the type service rendered. For example, for CPU intensive or long processing time required services, it would be more economical to processes one customer request at a time in order to minimize the context switching overhead. While the data transaction processes within a service request are subject to certain precedence relationships, the requests from different customers to an ASP are independent of each other, and the total time required to process a service request depends on the computing resource allocated to that request. The related issue of achieving an optimal use of resources at ASPs leads to problem of project scheduling with controllable project duration.In this paper, we present efficient algorithms for solving several special cases of such multi-project scheduling problems with controllable project duration and hard resource constraints. Two types of problems are considered. In type I, the duration of each project includes a constant and a term that is inversely proportional to the amount of resource allocated. In type II, the duration of each individual project is a continuous decreasing function of the amount of resource allocated. 相似文献
19.
20.
Influential observations in frontier models,a robust non-oriented approach to the water sector 总被引:1,自引:0,他引:1
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and
exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential
outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select
five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity
of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical
by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied
to the Portuguese drinking water sector, for which we have an unusually rich data set. 相似文献