首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
Due to the limitation of computational resources, traditional statistical methods are no longer applicable to large data sets. Subsampling is a popular method which can significantly reduce computational burden. This paper considers a subsampling strategy based on the least absolute relative error in the multiplicative model for massive data. In addition, we employ the random weighting and the least squares methods to handle the problem that the asymptotic covariance of the estimator is difficult to be estimated directly. Moreover, the comparison among the least absolute relative error, least absolute deviation and least squares under the optimal subsampling strategy are given in simulation studies and real examples.  相似文献   

2.
This article considers Monte Carlo integration under rejection sampling or Metropolis-Hastings sampling. Each algorithm involves accepting or rejecting observations from proposal distributions other than a target distribution. While taking a likelihood approach, we basically treat the sampling scheme as a random design, and define a stratified estimator of the baseline measure. We establish that the likelihood estimator has no greater asymptotic variance than the crude Monte Carlo estimator under rejection sampling or independence Metropolis-Hastings sampling. We employ a subsampling technique to reduce the computational cost, and illustrate with three examples the computational effectiveness of the likelihood method under general Metropolis-Hastings sampling.  相似文献   

3.
Abstract

All known robust location and scale estimators with high breakdown point for multivariate samples are very expensive to compute. In practice, this computation has to be carried out using an approximate subsampling procedure. In this article we describe an alternative subsampling scheme, applicable to both the Stahel-Donoho estimator and the minimum volume ellipsoid estimator, with the property that the number of subsamples required can be substantially reduced with respect to the standard subsampling procedures used in both cases. We also discuss some bias and variability properties of the estimator obtained from the proposed subsampling process.  相似文献   

4.
Data sets in high-dimensional spaces are often concentrated near low-dimensional sets. Geometric Multi-Resolution Analysis (Allard, Chen, Maggioni, 2012) was introduced as a method for approximating (in a robust, multiscale fashion) a low-dimensional set around which data may concentrated and also providing dictionary for sparse representation of the data. Moreover, the procedure is very computationally efficient. We introduce an estimator for low-dimensional sets supporting the data constructed from the GMRA approximations. We exhibit (near optimal) finite sample bounds on its performance, and demonstrate the robustness of this estimator with respect to noise and model error. In particular, our results imply that, if the data is supported on a low-dimensional manifold, the proposed sparse representations result in an error which depends only on the intrinsic dimension of the manifold. (© 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

5.
The phenomenon of nonresponse in a sample survey reduces the precision of parameter estimates and causes the bias. Several methods have been developed to compensate for these effects. An important technique is the double sampling scheme introduced by Hansen and Hurwitz (J. Am. Stat. Assoc. 41, 517–529, 1946) which relies on subsampling of nonrespondents and repeating efforts to collect data from subsampled units. Several generalizations of this procedure have been proposed, including the application of arbitrary sampling designs considered by Särndal et al. (Model Assisted Survey Sampling, 1992). Under the assumption of complete response in the second phase, the population mean estimator constructed using data from both phases is unbiased. In this paper the properties of the mean value estimator under two-phase sampling are investigated for the case of the above assumption not being met. Expressions for bias and variance are obtained for general two-phase sampling procedure involving arbitrary sampling designs in both phases. Stochastic nonresponse governed by separate response distributions in both phases is assumed. Some special cases are discussed.  相似文献   

6.
In sampling theory, the traditional ratio estimator is the most common estimator of the population mean when the correlation between study and auxiliary variables is positively high. We introduce a new ratio-type estimator based on the order statistics of a simple random sample. We show that this new estimator is considerably more efficient than the traditional ratio estimator under non-normality, and remarkably robust to data anomalies such as presence of outliers in data sets.  相似文献   

7.

We study the problem of drift estimation for two-scale continuous time series. We set ourselves in the framework of overdamped Langevin equations, for which a single-scale surrogate homogenized equation exists. In this setting, estimating the drift coefficient of the homogenized equation requires pre-processing of the data, often in the form of subsampling; this is because the two-scale equation and the homogenized single-scale equation are incompatible at small scales, generating mutually singular measures on the path space. We avoid subsampling and work instead with filtered data, found by application of an appropriate kernel function, and compute maximum likelihood estimators based on the filtered process. We show that the estimators we propose are asymptotically unbiased and demonstrate numerically the advantages of our method with respect to subsampling. Finally, we show how our filtered data methodology can be combined with Bayesian techniques and provide a full uncertainty quantification of the inference procedure.

  相似文献   

8.
We consider an infinite-dimensional isotonic regression problem which is an extension of the suitably revised classical isotonic regression problem. Given p-summable data, for p finite and at least one, there exists an optimal estimator to our problem. For p greater than one, this estimator is unique and is the limit in the p-norm of the sequence of unique estimators in canonical finite-dimensional truncations of our problem. However, for p equal to one, our problem, as well as the finite-dimensional truncations, admit multiple optimal estimators in general. In this case, the sequence of optimal estimator sets to the truncations converges to the optimal estimator set of the infinite problem in the sense of Kuratowski. Moreover, the selection of natural best optimal estimators to the truncations converges in the 1-norm to an optimal estimator of the infinite problem.  相似文献   

9.

In this paper, we investigate the quantile varying coefficient model for longitudinal data, where the unknown nonparametric functions are approximated by polynomial splines and the estimators are obtained by minimizing the quadratic inference function. The theoretical properties of the resulting estimators are established, and they achieve the optimal convergence rate for the nonparametric functions. Since the objective function is non-smooth, an estimation procedure is proposed that uses induced smoothing and we prove that the smoothed estimator is asymptotically equivalent to the original estimator. Moreover, we propose a variable selection procedure based on the regularization method, which can simultaneously estimate and select important nonparametric components and has the asymptotic oracle property. Extensive simulations and a real data analysis show the usefulness of the proposed method.

  相似文献   

10.
This paper is a continuation of the work in [11] and [2] on the problem of estimating by a linear estimator, N unobservable input vectors, undergoing the same linear transformation, from noise-corrupted observable output vectors. Whereas in the aforementioned papers, only the matrix representing the linear transformation was assumed uncertain, here we are concerned with the case in which the second order statistics of the noise vectors (i.e., their covariance matrices) are also subjected to uncertainty. We seek a robust mean-squared error estimator immuned against both sources of uncertainty. We show that the optimal robust mean-squared error estimator has a special form represented by an elementary block circulant matrix, and moreover when the uncertainty sets are ellipsoidal-like, the problem of finding the optimal estimator matrix can be reduced to solving an explicit semidefinite programming problem, whose size is independent of N. The research was partially supported by BSF grant #2002038  相似文献   

11.
Abstract

The existence of outliers in a data set and how to deal with them is an important problem in statistics. The minimum volume ellipsoid (MVE) estimator is a robust estimator of location and covariate structure; however its use has been limited because there are few computationally attractive methods. Determining the MVE consists of two parts—finding the subset of points to be used in the estimate and finding the ellipsoid that covers this set. This article addresses the first problem. Our method will also allow us to compute the minimum covariance determinant (MCD) estimator. The proposed method of subset selection is called the effective independence distribution (EID) method, which chooses the subset by minimizing determinants of matrices containing the data. This method is deterministic, yielding reproducible estimates of location and scatter for a given data set. The EID method of finding the MVE is applied to several regression data sets where the true estimate is known. Results show that the EID method, when applied to these data sets, produces the subset of data more quickly than conventional procedures and that there is less than 6% relative error in the estimates. We also give timing results illustrating the feasibility of our method for larger data sets. For the case of 10,000 points in 10 dimensions, the compute time is under 25 minutes.  相似文献   

12.
The problem of estimating regression coefficients from observations at a finite number of properly designed sampling points is considered when the error process has correlated values and no quadratic mean derivative. Sacks and Ylvisaker (1966,Ann. Math. Statist.,39, 66–89) found an asymptotically optimal design for the best linear unbiased estimator (BLUE). Here, the goal is to find an asymptotically optimal design for a simpler estimator. This is achieved by properly adjusting the median sampling design and the simpler estimator introduced by Schoenfelder (1978, Institute of Statistics Mimeo Series No. 1201, University of North Carolina, Chapel Hill). Examples with stationary (Gauss-Markov) and nonstationary (Wiener) error processes and with linear and nonlinear regression functions are considered both analytically and numerically.Research supported by the Air Force Office of Scientific Research Contract No. 91-0030.  相似文献   

13.
The asymptotic properties of a family of minimum quantile distance estimators for randomly censored data sets are considered. These procedures produce an estimator of the parameter vector that minimizes a weighted L2 distance measure between the Kaplan-Meier quantile function and an assumed parametric family of quantile functions. Regularity conditions are provided which insure that these estimators are consistent and asymptotically normal. An optimal weight function is derived for single parameter families, which, for location/scale families, results in censored sample analogs of estimators such as those suggested by Parzen.  相似文献   

14.
The asymptotic properties of a family of minimum quantile distance estimators for randomly censored data sets are considered. These procedures produce an estimator of the parameter vector that minimizes a weighted L2 distance measure between the Kaplan-Meier quantile function and an assumed parametric family of quantile functions. Regularity conditions are provided which insure that these estimators are consistent and asymptotically normal. An optimal weight function is derived for single parameter families, which, for location/scale families, results in censored sample analogs of estimators such as those suggested by Parzen.  相似文献   

15.
We introduce a method to minimize the mean square error (MSE) of an estimator which is derived from a classification. The method chooses an optimal discrimination threshold in the outcome of a classification algorithm and deals with the problem of unequal and unknown misclassification costs and class imbalance. The approach is applied to data from the MAGIC experiment in astronomy for choosing an optimal threshold for signal-background-separation. In this application one is interested in estimating the number of signal events in a dataset with very unfavorable signal to background ratio. Minimizing the MSE of the estimation is a rather general approach which can be adapted to various other applications, in which one wants to derive an estimator from a classification. If the classification depends on other or additional parameters than the discrimination threshold, MSE minimization can be used to optimize these parameters as well. We illustrate this by optimizing the parameters of logistic regression, leading to relevant improvements of the current approach used in the MAGIC experiment.  相似文献   

16.
The calibration method has been widely discussed in the recent literature on survey sampling, and calibration estimators are routinely computed by many survey organizations. The calibration technique was introduced in [12] to estimate linear parameters as mean or total. Recently, some authors have applied the calibration technique to estimate the finite distribution function and the quantiles. The computationally simpler method in [14] is built by means of constraints that require the use of a fixed value t0. The precision of the resulting calibration estimator changes with the selected point t0. In the present paper, we study the problem of determining the optimal value t0 that gives the best estimation under simple random sampling without replacement. A limited simulation study shows that the improvement of this optimal calibrated estimator over possible alternatives can be substantial.  相似文献   

17.
Histogram and kernel estimators are usually regarded as the two main classical data-based non- parametric tools to estimate the underlying density functions for some given data sets. In this paper we will integrate them and define a histogram-kernel error based on the integrated square error between histogram and binned kernel density estimator, and then exploit its asymptotic properties. Just as indicated in this paper, the histogram-kernel error only depends on the choice of bin width and the data for the given prior kernel densities. The asymptotic optimal bin width is derived by minimizing the mean histogram-kernel error. By comparing with Scott’s optimal bin width formula for a histogram, a new method is proposed to construct the data-based histogram without knowledge of the underlying density function. Monte Carlo study is used to verify the usefulness of our method for different kinds of density functions and sample sizes.  相似文献   

18.
It is shown that the subsampling methodology can be used to develop unit root tests when the noise sequence is heavy-tailed with infinite variance. Using least-squares residuals, we construct processes which approximately satisfy the null hypothesis and then, using subsampling, we approximate the null distribution of test statistics. We establish the asymptotic validity of this method and demonstrate its applicability in finite samples by means of a simulation study and a data example.  相似文献   

19.
Kernelized support vector machines (SVMs) belong to the most widely used classification methods. However, in contrast to linear SVMs, the computation time required to train such a machine becomes a bottleneck when facing large data sets. In order to mitigate this shortcoming of kernel SVMs, many approximate training algorithms were developed. While most of these methods claim to be much faster than the state-of-the-art solver LIBSVM, a thorough comparative study is missing. We aim to fill this gap. We choose several well-known approximate SVM solvers and compare their performance on a number of large benchmark data sets. Our focus is to analyze the trade-off between prediction error and runtime for different learning and accuracy parameter settings. This includes simple subsampling of the data, the poor-man’s approach to handling large scale problems. We employ model-based multi-objective optimization, which allows us to tune the parameters of learning machine and solver over the full range of accuracy/runtime trade-offs. We analyze (differences between) solvers by studying and comparing the Pareto fronts formed by the two objectives classification error and training time. Unsurprisingly, given more runtime most solvers are able to find more accurate solutions, i.e., achieve a higher prediction accuracy. It turns out that LIBSVM with subsampling of the data is a strong baseline. Some solvers systematically outperform others, which allows us to give concrete recommendations of when to use which solver.  相似文献   

20.
In this paper we discuss the problem of estimating the common mean of a bivariate normal population based on paired data as well as data on one of the marginals. Two double sampling schemes with the second stage sampling being either a simple random sampling (SRS) or a ranked set sampling (RSS) are considered. Two common mean estimators are proposed. It is found that under normality, the proposed RSS common mean estimator is always superior to the proposed SRS common mean estimator and other existing estimators such as the RSS regression estimator proposed by Yu and Lam (1997, Biometrics, 53, 1070–1080). The problem of estimating the mean Reid Vapor Pressure (RVP) of regular gasoline based on field and laboratory data is considered.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号