首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
随着信息技术的高速发展,每条数据所包含的信息越来越丰富,使得数据不可避免地含有异常值,且随着维数的增加,异常值出现的可能性更大。传统的主成分聚类分析对异常值特別敏感,基于MCD估计的主成分聚类方法虽然对异常值具有防御作用,但是在高维数据下MCD估计的偏差过大,其稳健性显著降低,而且当维数大于观测值个数时MCD估计失效。为此本文提出了基于MRCD估计的稳健主成分聚类方法,数值模拟和实证分析表明,基于MRCD估计的主成分聚类分析的效果优于传统的主成分聚类分析和基于MCD估计的主成分聚类分析,尤其是在维数大于样本观测值的情况下,MRCD估计更为有效。  相似文献   

2.
Robust techniques for multivariate statistical methods—such as principal component analysis, canonical correlation analysis, and factor analysis—have been recently constructed. In contrast to the classical approach, these robust techniques are able to resist the effect of outliers. However, there does not yet exist a graphical tool to identify in a comprehensive way the data points that do not obey the model assumptions. Our goal is to construct such graphics based on empirical influence functions. These graphics not only detect the influential points but also classify the observations according to their robust distances. In this way the observations are divided into four different classes which are regular points, nonoutlying influential points, influential outliers, and noninfluential outliers. We thus gain additional insight in the data by detecting different types of deviating observations. Some real data examples will be given to show how these plots can be used in practice.  相似文献   

3.
Kernel principal component analysis (KPCA) extends linear PCA from a real vector space to any high dimensional kernel feature space. The sensitivity of linear PCA to outliers is well-known and various robust alternatives have been proposed in the literature. For KPCA such robust versions received considerably less attention. In this article we present kernel versions of three robust PCA algorithms: spherical PCA, projection pursuit and ROBPCA. These robust KPCA algorithms are analyzed in a classification context applying discriminant analysis on the KPCA scores. The performances of the different robust KPCA algorithms are studied in a simulation study comparing misclassification percentages, both on clean and contaminated data. An outlier map is constructed to visualize outliers in such classification problems. A real life example from protein classification illustrates the usefulness of robust KPCA and its corresponding outlier map.  相似文献   

4.
主成分分析是多元统计分析中一种非常经典的降维技术。然而,经典主成分分析却是对离群值非常敏感的,常因离群值的存在导致结果与实际不相符。另一方面,当主成分分析用于综合评价时,主成分的含义常因载荷间绝对值大小不分明而含糊不清,从而导致综合评价难以展开。本文通过使用稳健稀疏主成分分析法进行模拟实验和实证分析,结果表明:该方法不仅能很好地抵抗离群值的影响,而且还能准确地识别出离群样本。通过该方法得出的主成分的含义也较经典主成分分析和稳健主成分分析更加地明确和贴近实际。  相似文献   

5.
The focus of this paper is to propose an approach to construct histogram values for the principal components of interval-valued observations. Le-Rademacher and Billard (J Comput Graph Stat 21:413–432, 2012) show that for a principal component analysis on interval-valued observations, the resulting observations in principal component space are polytopes formed by the convex hulls of linearly transformed vertices of the observed hyper-rectangles. In this paper, we propose an algorithm to translate these polytopes into histogram-valued data to provide numerical values for the principal components to be used as input in further analysis. Other existing methods of principal component analysis for interval-valued data construct the principal components, themselves, as intervals which implicitly assume that all values within an observation are uniformly distributed along the principal components axes. However, this assumption is only true in special cases where the variables in the dataset are mutually uncorrelated. Representation of the principal components as histogram values proposed herein more accurately reflects the variation in the internal structure of the observations in a principal component space. As a consequence, subsequent analyses using histogram-valued principal components as input result in improved accuracy.  相似文献   

6.
Principal component analysis (PCA) is a widely used tool for data analysis and dimension reduction in applications throughout science and engineering. However, the principal components (PCs) can sometimes be difficult to interpret, because they are linear combinations of all the original variables. To facilitate interpretation, sparse PCA produces modified PCs with sparse loadings, i.e. loadings with very few non-zero elements. In this paper, we propose a new sparse PCA method, namely sparse PCA via regularized SVD (sPCA-rSVD). We use the connection of PCA with singular value decomposition (SVD) of the data matrix and extract the PCs through solving a low rank matrix approximation problem. Regularization penalties are introduced to the corresponding minimization problem to promote sparsity in PC loadings. An efficient iterative algorithm is proposed for computation. Two tuning parameter selection methods are discussed. Some theoretical results are established to justify the use of sPCA-rSVD when only the data covariance matrix is available. In addition, we give a modified definition of variance explained by the sparse PCs. The sPCA-rSVD provides a uniform treatment of both classical multivariate data and high-dimension-low-sample-size (HDLSS) data. Further understanding of sPCA-rSVD and some existing alternatives is gained through simulation studies and real data examples, which suggests that sPCA-rSVD provides competitive results.  相似文献   

7.
This article extends the analysis of multivariate transformations to linear and quadratic discriminant analysis. It shows that the standard application of deletion diagnostic techniques for validating a particular transformation suffers from masking and so may fail if several outliers are present. We therefore suggest a simple and powerful method which is based on a forward search algorithm. This robust diagnostic procedure orders the observations from those most in agreement with the suggested model to those least in agreement with it. It provides a unified approach to the detection of inuential observations and outliers in discriminant analysis. Simulated and real data are used to show the necessity of considering multivariate transformations in discriminant analysis. The examples demonstrate the power of the suggested approach in revealing the correct structure of the data when this is obscured by outliers.  相似文献   

8.
主成分析分析法是一种将多个指标化为少数几个不相关的综合指标 (即主成分 )的多元统计分析方法 .本文通过运用主成分方法对我国台湾地区 1 989 1 996工农业主要指标的原始数据的处理分析 ,表明主成分分析确是在实用中很可行的一种常用的统计方法 .  相似文献   

9.
A variable selection method using global score estimation is proposed, which is applicable as a selection criterion in any multivariate method without external variables such as principal component analysis, factor analysis and correspondence analysis. This method selects a subset of variables by which we approximate the original global scores as much as possible in the context of least squares, where the global scores, e.g. principal component scores, factor scores and individual scores, are computed based on the selected variables. Global scores are usually orthogonal. Therefore, the estimated global scores should be restricted to being mutually orthogonal. According to how to satisfy that restriction, we propose three computational steps to estimate the scores. Example data is analyzed to demonstrate the performance and usefulness of the proposed method, in which the proposed algorithm is evaluated and the results obtained using four cost-saving selection procedures are compared. This example shows that combining these steps and procedures yields more accurate results quickly.  相似文献   

10.
在合理选择数据规格化方法基础上,建立水污染状况评价的变量加权主成分分析方法,并以判别分析方法就水质污染级别进行校验,实证分析结果表明:改进后的主成分分析方法和评价体系较已有方法效果有显著提高。  相似文献   

11.
We propose new tools for visualizing large amounts of functional data in the form of smooth curves. The proposed tools include functional versions of the bagplot and boxplot, which make use of the first two robust principal component scores, Tukey’s data depth and highest density regions.

By-products of our graphical displays are outlier detection methods for functional data. We compare these new outlier detection methods with existing methods for detecting outliers in functional data, and show that our methods are better able to identify outliers.

An R-package containing computer code and datasets is available in the online supplements.  相似文献   

12.
Dimension reduction techniques are at the core of the statistical analysis of high-dimensional and functional observations. Whether the data are vector- or function-valued, principal component techniques, in this context, play a central role. The success of principal components in the dimension reduction problem is explained by the fact that, for any \(K\le p\), the K first coefficients in the expansion of a p-dimensional random vector \(\mathbf{X}\) in terms of its principal components is providing the best linear K-dimensional summary of \(\mathbf X\) in the mean square sense. The same property holds true for a random function and its functional principal component expansion. This optimality feature, however, no longer holds true in a time series context: principal components and functional principal components, when the observations are serially dependent, are losing their optimal dimension reduction property to the so-called dynamic principal components introduced by Brillinger in 1981 in the vector case and, in the functional case, their functional extension proposed by Hörmann, Kidziński and Hallin in 2015.  相似文献   

13.
Because of its orthogonality, interpretability and best representation, functional principal component analysis approach has been extensively used to estimate the slope function in the functional linear model. However, as a very popular smooth technique in nonparametric/semiparametric regression, polynomial spline method has received little attention in the functional data case. In this paper, we propose the polynomial spline method to estimate a partial functional linear model. Some asymptotic results are established, including asymptotic normality for the parameter vector and the global rate of convergence for the slope function. Finally, we evaluate the performance of our estimation method by some simulation studies.  相似文献   

14.
We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.  相似文献   

15.
??n this paper, we propose composite quantile regression for functional linear model with dependent data, in which the errors are from a short-range dependent and strictly stationary linear process. The functional principal component analysis is employed to approximate the slope function and the functional predictive variable respectively to construct an estimator of the slope function, and the convergence rate of the estimator is obtained under some regularity conditions. Simulation studies and a real data analysis are presented for illustration of the performance of the proposed estimator.  相似文献   

16.
综合评价中异常值的识别及无量纲化处理方法   总被引:1,自引:0,他引:1       下载免费PDF全文
针对综合评价中的异常值现象,讨论了原始数据中是否存在异常值、若存在异常值该如何识别异常值以及对含有异常值的评价数据如何进行无量纲化处理三个问题。关于异常值的判断与识别,给出了以“中位数”为参考点,通过比较排序后两端数据偏离中位数的距离的处理思路。对含有异常值的评价数据的无量纲化处理问题,基于常用的“极值处理法”,通过分别指定异常值和非异常值无量纲化取值区间的方式,提出了一种分段的无量纲化处理方法。最后,通过与已有文献异常值识别及无量纲化处理结果的对比分析,验证了本文方法的有效性,发现本文给出的方法能够实现对异常值的适度筛选,且能够提升无量纲化数据分布均衡性。  相似文献   

17.
Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.  相似文献   

18.
基于主成分分析的水质评价方法   总被引:6,自引:0,他引:6  
主成分分析法能够在保证原始数据信息损失最小的情况下,以少数的综合变量取代原有的多维变量,使数据结构大为简化,并且客观地确定变量权数,避免了主观随意性.应用主成分分析法对长春市地面水环境进行评价,且与其它评价方法相比较,结果显示主成分分析法更客观且指导性较强,是一种行之有效的水质评价方法.通过主成分分析进行水质评价,可为水资源规划、利用、开发和环境系统优化提供更为客观的参考依据.  相似文献   

19.
This paper studies how to identify influential observations in the functional linear model in which the predictor is functional and the response is scalar. Measurement of the effects of a single observation on estimation and prediction when the model is estimated by the principal components method is undertaken. For that, three statistics are introduced for measuring the influence of each observation on estimation and prediction of the functional linear model with scalar response that are generalizations of the measures proposed for the standard regression model by [D.R. Cook, Detection of influential observations in linear regression, Technometrics 19 (1977) 15-18; D. Peña, A new statistic for influence in linear regression, Technometrics 47 (2005) 1-12] respectively. A smoothed bootstrap method is proposed to estimate the quantiles of the influence measures, which allows us to point out which observations have the larger influence on estimation and prediction. The behavior of the three statistics and the quantile estimation bootstrap based method is analyzed via a simulation study. Finally, the practical use of the proposed statistics is illustrated by the analysis of a real data example, which show that the proposed measures are useful for detecting heterogeneity in the functional linear model with scalar response.  相似文献   

20.
n this paper, we propose composite quantile regression for functional linear model with dependent data, in which the errors are from a short-range dependent and strictly stationary linear process. The functional principal component analysis is employed to approximate the slope function and the functional predictive variable respectively to construct an estimator of the slope function, and the convergence rate of the estimator is obtained under some regularity conditions. Simulation studies and a real data analysis are presented for illustration of the performance of the proposed estimator.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号