首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Principal component analysis(PCA)is one of the most popular multivariate data analysis techniques for dimension reduction and data mining,and is widely used in many fields ranging from industry and biology to finance and social development.When working on big data,it is of great necessity to consider the online version of PCA,in which only a small subset of samples could be stored.To handle the online PCA problem,Oja(1982)presented the stochastic power method under the assumption of zero-mean samples,and there have been lots of theoretical analysis and modified versions of this method in recent years.However,a common circumstance where the samples have nonzero mean is seldom studied.In this paper,we derive the convergence rate of a nonzero-mean version of Oja’s algorithm with diminishing stepsizes.In the analysis,we succeed in handling the dependency between each iteration,which is caused by the updated mean term for data centering.Furthermore,we verify the theoretical results by several numerical tests on both artificial and real datasets.Our work offers a way to deal with the top-1 online PCA when the mean of the given data is unknown.  相似文献   

2.
Deleting Outliers in Robust Regression with Mixed Integer Programming   总被引:1,自引:0,他引:1  
In robust regression we often have to decide how many are the unusual observations, which should be removed from the sample in order to obtain better fitting for the rest of the observations. Generally, we use the basic principle of LTS, which is to fit the majority of the data, identifying as outliers those points that cause the biggest damage to the robust fit. However, in the LTS regression method the choice of default values for high break down-point affects seriously the efficiency of the estimator. In the proposed approach we introduce penalty cost for discarding an outlier, consequently, the best fit for the majority of the data is obtained by discarding only catastrophic observations. This penalty cost is based on robust design weights and high break down-point residual scale taken from the LTS estimator. The robust estimation is obtained by solving a convex quadratic mixed integer programming problem, where in the objective function the sum of the squared residuals and penalties for discarding observations is minimized. The proposed mathematical programming formula is suitable for small-sample data. Moreover, we conduct a simulation study to compare other robust estimators with our approach in terms of their efficiency and robustness.  相似文献   

3.
The principal component analysis (PCA) is an effective statistical analysis method in statistical data analysis, feature extraction and data compression. The method simplifies multiple related variables into a linear combination of several irrelevant variables, through the less-comprehensive index as far as possible to replace many of the original data, and can reflect the information provided by the original data. This paper studies the signal feature extraction algorithm based on PCA, and extracts sequences’ feature which generated by Logistic mapping. Then we measured the complexity of the reconstructed chaotic sequences by the permutation entropy algorithm. The testing results show that the complexity of the reconstruction sequences is significantly higher than the original sequences.  相似文献   

4.
Portfolio selection is an important issue in finance and it involves the balance between risk and return. This paper investigates portfolio selection under Mean-CVa R model in a nonparametric framework with α-mixing data as financial data tends to be dependent. Many works have provided some insight into the performance of portfolio selection from the aspects of data and simulation while in this paper we concentrate on the asymptotic behaviors of the optimal solutions and risk estimation in theory.  相似文献   

5.
Receiver operating characteristic (ROC) curves are often used to study the two sample problem in medical studies. However, most data in medical studies are censored. Usually a natural estimator is based on the Kaplan-Meier estimator. In this paper we propose a smoothed estimator based on kernel techniques for the ROC curve with censored data. The large sample properties of the smoothed estimator are established. Moreover, deficiency is considered in order to compare the proposed smoothed estimator of the ROC curve with the empirical one based on Kaplan-Meier estimator. It is shown that the smoothed estimator outperforms the direct empirical estimator based on the Kaplan-Meier estimator under the criterion of deficiency. A simulation study is also conducted and a real data is analyzed.  相似文献   

6.
Non-random missing data poses serious problems in longitudinal studies. The binomial distribution parameter becomes to be unidentifiable without any other auxiliary information or assumption when it suffers from ignorable missing data. Existing methods are mostly based on the log-linear regression model. In this article, a model is proposed for longitudinal data with non-ignorable non-response. It is considered to use the pre-test baseline data to improve the identifiability of the post-test parameter. Furthermore, we derive the identified estimation (IE), the maximum likelihood estimation (MLE) and its associated variance for the post-test parameter. The simulation study based on the model of this paper shows that the proposed approach gives promising results.  相似文献   

7.
Line transect sampling is a very useful method in survey of wildlife population. Confident interval estimation for density D of a biological population is proposed based on a sequential design. The survey area is occupied by the population whose size is unknown. A stopping rule is proposed by a kernel-based estimator of density function of the perpendicular data at a distance. With this stopping rule, we construct several confidence intervals for D by difference procedures. Some bias reduction techniques are used to modify the confidence intervals. These intervals provide the desired coverage probability as the bandwidth in the stopping rule approaches zero. A simulation study is also given to illustrate the performance of this proposed sequential kernel procedure.  相似文献   

8.
Hierarchical linear regression models for conditional quantiles   总被引:3,自引:0,他引:3  
The quantile regression has several useful features and therefore is gradually developing into a comprehensive approach to the statistical analysis of linear and nonlinear response models, but it cannot deal effectively with the data with a hierarchical structure. In practice, the existence of such data hierarchies is neither accidental nor ignorable, it is a common phenomenon. To ignore this hierarchical data structure risks overlooking the importance of group effects, and may also render many of the traditional statistical analysis techniques used for studying data relationships invalid. On the other hand, the hierarchical models take a hierarchical data structure into account and have also many applications in statistics, ranging from overdispersion to constructing min-max estimators. However, the hierarchical models are virtually the mean regression, therefore, they cannot be used to characterize the entire conditional distribution of a dependent variable given high-dimensional covariates. Furthermore, the estimated coefficient vector (marginal effects) is sensitive to an outlier observation on the dependent variable. In this article, a new approach, which is based on the Gauss-Seidel iteration and taking a full advantage of the quantile regression and hierarchical models, is developed. On the theoretical front, we also consider the asymptotic properties of the new method, obtaining the simple conditions for an n1/2-convergence and an asymptotic normality. We also illustrate the use of the technique with the real educational data which is hierarchical and how the results can be explained.  相似文献   

9.
Xu  Kai  Cao  Mingxiang 《中国科学 数学(英文版)》2021,64(10):2327-2356
We use distance covariance to introduce novel consistent tests of heteroscedasticity for nonlinear regression models in multidimensional spaces. The proposed tests require no user-defined regularization, which are simple to implement based on only pairwise distances between points in the sample and are applicable even if we have non-normal errors and many covariates in the regression model. We establish the asymptotic distributions of the proposed test statistics under the null and alternative hypotheses and a sequence of local alternatives converging to the null at the fastest possible parametric rate. In particular, we focus on whether and how the estimation of the finite-dimensional unknown parameter vector in regression functions will affect the distribution theory. It turns out that the asymptotic null distributions of the suggested test statistics depend on the data generating process, and then a bootstrap scheme and its validity are considered. Simulation studies demonstrate the versatility of our tests in comparison with the score test, the Cramér-von Mises test,the Kolmogorov-Smirnov test and the Zheng-type test. We also use the ultrasonic reference block data set from National Institute of Standards and Technology of USA to illustrate the practicability of our proposals.  相似文献   

10.
We present our recent work on both linear and nonlinear data reduction methods and algorithms: for the linear case we discuss results on structure analysis of SVD of column-partitioned matrices and sparse low-rank approximation; for the nonlinear case we investigate methods for nonlinear dimensionality reduction and manifold learning. The problems we address have attracted great deal of interest in data mining and machine learning.  相似文献   

11.
Least squares method based on Euclidean distance and Lebesgue distance between fuzzy data is used to study parameter estimation of fuzzy linear regression model based on case deletion respectively. And the parameter estimations on two kinds of distance are compared. The input of the above model is real data and output is fuzzy data. The statistical diagnosis --- estimation standard error of regression equations is constructed to test highly influential point or outlier in observation data. At last through identifying highly influential point or outlier in actual data, it shows that the statistic constructed in this paper is effective.  相似文献   

12.
利用Pena距离对加权线性最小二乘估计的影响问题进行讨论,得到加权最小二乘估计的Pena距离的表达式,对其性质进行讨论,从而得到高杠异常点的判别方法.文中对Pena距离与Cook距离的性能进行了对比,得到在一定条件下Pena距离优于Cook距离的结论.并通过数值实验对此方法的有效性进行验证.  相似文献   

13.
胡江 《工科数学》2012,(5):80-85
基于pena距离统计量对非线性回归模型的影响分析进行了讨论,得到了非线性回归模型的pena距离公式,并对公式的分析性质以及其对高杠异常点的检测作用做出了相应的结论,得出了在一定条件下pena距离对异常点的检测优于Cook距离的结论,特别是对高杠杆异常点的检验,pena距离的效果更加明显,给出了实际数据检验结果,对方法的有效性进行了验证。  相似文献   

14.
INFLUENCE ANALYSIS ON EXPONENTIAL NONLINEAR MODELS WITH RANDOM EFFECTS   总被引:5,自引:0,他引:5  
This paper presents a unified diagnostic method for exponential nonlinearmodels with random effects based upon the joint likelihood given by Robinson in 1991.The authors show that the case deletion model is equivalent to mean shift outlier model.From this point of view, several diagnostic measures, such as Cook distance, score statistics  相似文献   

15.
带随机效应非线性模型的影响分析   总被引:3,自引:0,他引:3  
Abstract. In this paper,a unified diagnostic method for the nonlinear models with random ef-fects based upon the joint likelihood given by Robinson in 1991 is presented. It is shown that thecase deletion model is equivalent to the mean shift outlier model. From this point of view ,sever-al diagnostic measures, such as Cook distance, score statistics are derived. The local influencemeasure of Cook is also presented. A numerical example illustrates that the method is avail-able  相似文献   

16.
随着信息技术的高速发展,每条数据所包含的信息越来越丰富,使得数据不可避免地含有异常值,且随着维数的增加,异常值出现的可能性更大。传统的主成分聚类分析对异常值特別敏感,基于MCD估计的主成分聚类方法虽然对异常值具有防御作用,但是在高维数据下MCD估计的偏差过大,其稳健性显著降低,而且当维数大于观测值个数时MCD估计失效。为此本文提出了基于MRCD估计的稳健主成分聚类方法,数值模拟和实证分析表明,基于MRCD估计的主成分聚类分析的效果优于传统的主成分聚类分析和基于MCD估计的主成分聚类分析,尤其是在维数大于样本观测值的情况下,MRCD估计更为有效。  相似文献   

17.
空间变系数模型的统计诊断   总被引:1,自引:0,他引:1  
空间变系数模型作为一类有效的空间数据分析方法已经得到了广泛的应用.本文主要研究该模型的统计诊断与影响分析方法。首先我们基于数据删除模型定义了Cook统计量,其次我们基于均值漂移模型讨论了异常点的检验问题。  相似文献   

18.
提出了具有高斯过程误差的函数型回归模型的几种诊断方法.在此模型中,首先,在样条基的基础上,推导了回归系数函数的估计.随后,证明了数据删失模型和均值漂移模型的等价性.然后,研究了三种诊断方法,即残差分析、Cook距离和似然距离来诊断异常和强影响数据.最后,通过一个模拟例子和一个实例来阐述方法的有效性.  相似文献   

19.
为了更好地拟合偏态数据,充分提取偏态数据的信息,针对偏正态数据建立了众数回归模型,并基于Pena距离统计量对众数回归模型进行统计断研究,得到了众数回归模型的Pena距离表达式以及高杠杆异常点的诊断方法.利用EM算法与梯度下降法给出了众数回归模型参数的极大似然估计,根据数据删除模型计算似然距离、Cook距离和Pena距离统计量,绘制诊断统计图.通过Monte Carlo模拟试验和实例分析比较,说明文章提出的方法行之有效,并在一定条件下Pena距离对异常点或强影响点的诊断优于似然距离和Cook距离.  相似文献   

20.
对非线性散度模型在Euclid空间建立几何结构。在此基础上,研究了均值漂移模型的曲率度量。从而导出相应Cook距离,似然距离等诊断统计量的二阶近似公式。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号