首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 169 毫秒
1.
对模型精度与稳健性的要求使得异常值检测与稳健估计在模型构建中变得日益重要.本文首先利用基于边际相关系数构造的高维影响度量指标(HIM)与基于距离相关系数构造的高维数据异常值判别方法(HDC)分别对数据中的异常值进行初步检测,将数据集中的点分为正常点与异常点两类,然后在初始正常点集的基础上利用稳健的参数估计方法和残差空间超椭球等高面的概念构造了对初始正常点集中误判点的纠正方法,并对初始异常点集中各点的异常值概率重新进行计算,以进一步纠正误判入异常点集的正常点,最终对异常值检测的准确率进行进一步的提升.通过对两种数据结构下三种不同类型异常数据的模拟,证明了所提方法的有效性,并通过实例进行验证与分析.  相似文献   

2.
综合评价中异常值的识别及无量纲化处理方法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对综合评价中的异常值现象,讨论了原始数据中是否存在异常值、若存在异常值该如何识别异常值以及对含有异常值的评价数据如何进行无量纲化处理三个问题。关于异常值的判断与识别,给出了以“中位数”为参考点,通过比较排序后两端数据偏离中位数的距离的处理思路。对含有异常值的评价数据的无量纲化处理问题,基于常用的“极值处理法”,通过分别指定异常值和非异常值无量纲化取值区间的方式,提出了一种分段的无量纲化处理方法。最后,通过与已有文献异常值识别及无量纲化处理结果的对比分析,验证了本文方法的有效性,发现本文给出的方法能够实现对异常值的适度筛选,且能够提升无量纲化数据分布均衡性。  相似文献   

3.
缺失数据处理是数据挖掘领域中进行数据预处理的一个重要环节,由于成分数据特殊的几何性质,传统的缺失值填补方法不能直接用于这种类型的数据.因此,对成分数据而言,缺失值的填补具有十分重要的意义.为了解决这个问题,本文利用了成分数据和欧氏数据之间的关系,提出了一种基于随机森林的成分数据缺失值迭代填补法,该方法的实施和评估采用模拟和真实的数据集.实验结果表明:新的填补方法可广泛应用于多种类型的数据集且具有较高准确性.  相似文献   

4.
基于异常值对异质性参数和回归系数估计同时影响的这一新视角下,文章利用方差加权异常值模型(variance-weight outlier model,VWOM)研究了随机效应Meta回归模型的多个异常值识别及其修正问题。首先,推导出Meta回归VWOM分别使用ML和REML估计方法的Score (SC)检验统计量,并考虑Meta回归VWOM的三种扰动方式,包括全局方差扰动,个体方差扰动和随机误差扰动,证明了三种方差扰动的SC检验统计量是等价的。其次,基于异常值对异质性参数和回归系数估计同时影响的考虑,提出了随机效应Meta回归方差加权异常值修正模型(variance-weight outlier modified model,VWOMM),并给出了VWOMM参数的ML和REML估计迭代算法并进行数值求解。此外,通过随机模拟分析验证了SC检验统计量的尺度和功效。最后,利用两个不同类型效应量异常值识别及其处理的实例分析结果,表明了Meta回归VWOM的SC检验统计量识别效果较为显著,VWOMM能有效改善模型拟合程度,为识别和处理复杂数据的异常值提供了一种新的思路和方法。  相似文献   

5.
随着人们生活水平的提高,与生活质量相关的临床研究和心理学研究越来越多.生活质量是对一个人生活多方面的综合测评.与对成年人生活质量的研究相比,针对青少年以及幼儿生活质量的调查与分析很少,更不用说对福利院中残障的孤儿了.本文以长沙市第一社会福利院为例进行关于孤儿生活质量方面的调查,选择适用于残障儿童的一套生活质量量表来对其进行评估.同时还以常态儿童为对照组,使用通用量表进行调查.调查获得的数据首先使用EXCEL进行分数计算和处理,然后使用SPSS进行主成分分析,最终分析出孤儿生活存在的问题,并提出需要改善的地方及一些合理的意见.  相似文献   

6.
可拓数据挖掘在高校教学质量评价中的应用   总被引:3,自引:1,他引:2  
高校在教学和管理工作中积累了大量的数据,但这些数据没有得到有效利用.将可拓数据挖掘技术引入教学领域,从教学评价数据中提取出隐藏在数据之中的有用信息,为教学管理者提供决策支持.首先通过可拓分析,寻找质量达到要求、可进行有效挖掘的教学评价数据,然后对这些数据进行两方面的挖掘:影响教学质量的关键因素挖掘、教学质量与教师特征之间的关联规则挖掘.  相似文献   

7.
针对原有千车故障数统计方法上的不足,本文从改进统计方法着手,提出一种新的统计方法即重新定义千车故障数,然后利用数据挖掘中的聚类分析方法将具有相同特征的批次综合起来考虑,建立通用的运筹模型.针对缺失数据、近期预测这两个问题,本文对通用模型进行调整,“学习”出同类数据间的不同权值,然后利用加权数据,并通过拟合曲线来求出预测值.由于远期预测中数据的严重缺乏,则是从纯粹统计学的角度出发,计算得到预测值.预测模型通用性强,适用面较广.本文应用了SAS和MATLAB两种软件来求解上述模型,预测结果准确率较高,并且符合实际情况.  相似文献   

8.
目前基于遥测数据的卫星在轨状态监测和异常检测,主要通过频谱分析等信号处理方法提取遥测特征,难以适应卫星遥测数据离散取值、数据量大、异常和噪声复杂等特点,所提取的特征量特点不突出,难以满足遥测数据异常检测要求.提出基于波动特征的卫星遥测数据特征提取方法,以遥测数据变化频数或累积变化次数作为卫星遥测数据特征,具有实现简单、快速高效、对异常数据不敏感等特点.基于所提取的波动特征,提出一种基于序贯概率比检验(SPRT)的卫星在轨异常检测方法.实例分析结果表明,所提取的特征量能够较好地识别卫星异常,具有较高的计算效率和较好的检测性能.  相似文献   

9.
针对ARMA模型建模过程中模型识别和参数估计易受观测值异常点影响问题,构建了同时考虑加性异常点和更新性异常点的ARMA模型.运用基于Gibbs抽样的Markov Chain Monte Carlo贝叶斯方法,估计稳健ARMA模型参数,同步确定观测值中异常点的位置,辨别异常点类型.并利用我国人口自然增长数据进行仿真分析,研究结果表明:贝叶斯方法能够有效地识别ARMA序列的异常点.  相似文献   

10.
针对包含多个正常类的多元数据异常检测问题,提出了一种基于多分类马田系统的半监督数据异常检测方法.通过对训练数据集中的每个正常类分别建立马氏空间,构建了基于马氏距离的多类测量尺度,方法对测试数据集中正常数据进行分类的同时,能够实现对异常数据的检测.通过模拟带异常值的高斯混合模型数据验证了该方法的有效性.  相似文献   

11.
聂斌  王曦  胡雪 《运筹与管理》2019,28(1):101-107
在质量控制领域,非线性轮廓异常点识别问题是重点研究问题之一。本文综合运用了小波分析、数据深度、聚类分析等数据分析处理技术,提出了一种新的非正态变异的异常点识别方法。文章通过仿真分析技术,将新方法χ2与控制图方法进行性能对比,结果证实新方法能够以更高的准确率和稳定性识别异常点,表现出更好的异常点识别性能。最后将新方法应用于木板垂直密度轮廓实例对新方法进行验证,分析结果表明本方法能够有效识别出异常轮廓数据。  相似文献   

12.
Abstract

The massive flood of numbers in ongoing large-scale periodic economic and social surveys commonly leaves little time for anything but a cursory examination of the quality of the data, and few techniques exist for giving an overview of data activity. At the U.S. Bureau of Labor Statistics, a graphical and query-based solution to these problems has recently been adopted for data review in the Current Employment Statistics survey. Chief among the motivations for creating the new system were: (1) Reduce or eliminate the arduous paper review of thousands of sample reports by review analysts; (2) allow the review analysts a more global view of sample activity and at the same time make outlier detection less of a strain; and (3) present global views of estimates over time and among groups of subestimates. The specific graphics approaches used in the new system were designed to quickly portray both time series and cross-sectional aspects of the data, as these are both critical elements in the review process. The described system allows the data analysts to track down suspicious sample members by first graphically pinpointing questionable estimates, and then pinpointing questionable sample data used to produce those estimates. Query methods are used for cross-checking relationships among different sample data elements. Although designed for outlier detection and estimation, the data-representation methods employed in the system have opened up new possibilities for further statistical and economic uses of the data. The authors were torn between the desire for a completely automatic system of data review and the practical demands of an actual survey operating under imperfect conditions, and thus viewed the new system as an evolutionary advance, not as an ideal final solution. Possibilities opened up by the new system prompted some further thinking on finding an ideal state.  相似文献   

13.
Summary  The problem of detection of multidimensional outliers is a fundamental and important problem in applied statistics. The unreliability of multivariate outlier detection techniques such as Mahalanobis distance and hat matrix leverage has led to development of techniques which have been known in the statistical community for well over a decade. The literature on this subject is vast and growing. In this paper, we propose to use the artificial intelligence technique ofself-organizing map (SOM) for detecting multiple outliers in multidimensional datasets. SOM, which produces a topology-preserving mapping of the multidimensional data cloud onto lower dimensional visualizable plane, provides an easy way of detection of multidimensional outliers in the data, at respective levels of leverage. The proposed SOM based method for outlier detection not only identifies the multidimensional outliers, it actually provides information about the entire outlier neighbourhood. Being an artificial intelligence technique, SOM based outlier detection technique is non-parametric and can be used to detect outliers from very large multidimensional datasets. The method is applied to detect outliers from varied types of simulated multivariate datasets, a benchmark dataset and also to real life cheque processing dataset. The results show that SOM can effectively be used as a useful technique for multidimensional outlier detection.  相似文献   

14.
Outlier mining is an important aspect in data mining and the outlier mining based on Cook distance is most commonly used. But we know that when the data have multicoUinearity, the traditional Cook method is no longer effective. Considering the excellence of the principal component estimation, we use it to substitute the least squares estimation, and then give the Cook distance measurement based on principal component estimation, which can be used in outlier mining. At the same time, we have done some research on related theories and application problems.  相似文献   

15.
Advances in Data Analysis and Classification - We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed...  相似文献   

16.
We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions.

By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers.

An R-package containing computer code and datasets is available in the online supplements.  相似文献   

17.
Cluster-based outlier detection   总被引:1,自引:0,他引:1  
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986, 2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points.  相似文献   

18.
In many large-scale project scheduling problems, multiple projects are either taking place at the same time or scheduled into a tight sequence in order to efficiently share a common resource. One example of this is the computing resource allocation at an Application Service Provider (ASP) which provides data processing services for multiple paying customers. Typical services provided by ASPs are data mining, payroll processing, internet-based storage backup services and Customer Relation Management (CRM) services. The processing mode of an ASP can be either batch or concurrent, depending on the type service rendered. For example, for CPU intensive or long processing time required services, it would be more economical to processes one customer request at a time in order to minimize the context switching overhead. While the data transaction processes within a service request are subject to certain precedence relationships, the requests from different customers to an ASP are independent of each other, and the total time required to process a service request depends on the computing resource allocated to that request. The related issue of achieving an optimal use of resources at ASPs leads to problem of project scheduling with controllable project duration.In this paper, we present efficient algorithms for solving several special cases of such multi-project scheduling problems with controllable project duration and hard resource constraints. Two types of problems are considered. In type I, the duration of each project includes a constant and a term that is inversely proportional to the amount of resource allocated. In type II, the duration of each individual project is a continuous decreasing function of the amount of resource allocated.  相似文献   

19.
讨论了线性v-支持向量回归机中参数v的意义,并给出了严格的理论证明。利用v-支持向量回归机中ε-不敏感损失函数及参数v的意义,提出一种回归数据中的异常值检测方法。采用线性模型使得该方法不仅速度快而且能处理大规模数据。数值实验证明其具有可行性和有效性。  相似文献   

20.
This paper suggests an outlier detection procedure which applies a nonparametric model accounting for undesired outputs and exogenous influences in the sample. Although efficiency is estimated in a deterministic frontier approach, each potential outlier initially benefits of the doubt of not being an outlier. We survey several outlier detection procedures and select five complementary methodologies which, taken together, are able to detect all influential observations. To exploit the singularity of the leverage and the peer count, the super-efficiency and the order-m method and the peer index, it is proposed to select these observations as outliers which are simultaneously revealed as atypical by at least two of the procedures. A simulated example demonstrates the usefulness of this approach. The model is applied to the Portuguese drinking water sector, for which we have an unusually rich data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号