首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
考虑高维部分线性模型,提出了同时进行变量选择和估计兴趣参数的变量选择方法.将Dantzig变量选择应用到线性部分及非参数部分的各阶导数,从而获得参数和非参数部分的估计,且参数部分的估计具有稀疏性,证明了估计的非渐近理论界.最后,模拟研究了有限样本的性质.  相似文献   

2.
Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet process with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.  相似文献   

3.
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.  相似文献   

4.
Gaussian graphical models represent the underlying graph structure of conditional dependence between random variables, which can be determined using their partial correlation or precision matrix. In a high-dimensional setting, the precision matrix is estimated using penalized likelihood by adding a penalization term, which controls the amount of sparsity in the precision matrix and totally characterizes the complexity and structure of the graph. The most commonly used penalization term is the L1 norm of the precision matrix scaled by the regularization parameter, which determines the trade-off between sparsity of the graph and fit to the data. In this article, we propose several procedures to select the regularization parameter in the estimation of graphical models that focus on recovering reliably the appropriate network structure of the graph. We conduct an extensive simulation study to show that the proposed methods produce useful results for different network topologies. The approaches are also applied in a high-dimensional case study of gene expression data with the aim to discover the genes relevant to colon cancer. Using these data, we find graph structures, which are verified to display significant biological gene associations. Supplementary material is available online.  相似文献   

5.
Trace regression models are widely used in applications involving panel data, images, genomic microarrays, etc., where high-dimensional covariates are often involved. However, the existing research involving high-dimensional covariates focuses mainly on the condition mean model. In this paper, we extend the trace regression model to the quantile trace regression model when the parameter is a matrix of simultaneously low rank and row (column) sparsity. The convergence rate of the penalized estimator is derived under mild conditions. Simulations, as well as a real data application, are also carried out for illustration.  相似文献   

6.
The semilinear in-slide models (SLIMs) have been shown to be effective methods for normalizing microarray data [J. Fan, P. Tam, G. Vande Woude, Y. Ren, Normalization and analysis of cDNA micro-arrays using within-array replications applied to neuroblastoma cell response to a cytokine, Proceedings of the National Academy of Science (2004) 1135-1140]. Using a backfitting method, [J. Fan, H. Peng, T. Huang, Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency, Journal of American Statistical Association, 471, (2005) 781-798] proposed a profile least squares (PLS) estimation for the parametric and nonparametric components. The general asymptotic properties for their estimator is not developed. In this paper, we consider a new approach, two-stage estimation, which enables us to establish the asymptotic normalities for both of the parametric and nonparametric component estimators. We further propose a plug-in bandwidth selector using the asymptotic normality of the nonparametric component estimator. The proposed method allow for the modeling of the aggregated SLIMs case where we can explicitly show that taking the aggregated information into account can improve both of the parametric and nonparametric component estimator by the proposed two-stage approach. Some simulation studies are conducted to illustrate the finite sample performance of the proposed procedures.  相似文献   

7.
We propose in this article a unified approach to functional estimation problems based on possibly censored data. The general framework that we define allows, for instance, to handle density and hazard rate estimation based on randomly right-censored data, or regression. Given a collection of histograms, our estimation procedure consists in selecting the best histogram among that collection from the data, by minimizing a penalized least-squares type criterion. For a general collection of histograms, we obtain nonasymptotic oracle-type inequalities. Then, we consider the collection of histograms built on partitions into dyadic intervals, a choice inspired by an approximation result due to DeVore and Yu. In that case, our estimator is also adaptive in the minimax sense over a wide range of smoothness classes that contain functions of inhomogeneous smoothness. Besides, its computational complexity is only linear in the size of the sample.  相似文献   

8.
We provide in this paper a fully adaptive penalized procedure to select a covariance among a collection of models observing i.i.d. replications of the process at fixed observation points. For this we generalize the results of [3] and propose to use a data driven penalty to obtain an oracle inequality for the estimator. We prove that this method is an extension of the work by Baraud in [1] to the matrix regression model.  相似文献   

9.
For high-dimensional supervised learning problems, often using problem-specific assumptions can lead to greater accuracy. For problems with grouped covariates, which are believed to have sparse effects both on a group and within group level, we introduce a regularized model for linear regression with ?1 and ?2 penalties. We discuss the sparsity and other regularization properties of the optimal fit for this model, and show that it has the desired effect of group-wise and within group sparsity. We propose an algorithm to fit the model via accelerated generalized gradient descent, and extend this model and algorithm to convex loss functions. We also demonstrate the efficacy of our model and the efficiency of our algorithm on simulated data. This article has online supplementary material.  相似文献   

10.
We study the properties of the Lasso in the high-dimensional partially linear model where the number of variables in the linear part can be greater than the sample size. We use truncated series expansion based on polynomial splines to approximate the nonparametric component in this model. Under a sparsity assumption on the regression coefficients of the linear component and some regularity conditions, we derive the oracle inequalities for the prediction risk and the estimation error. We also provide sufficient conditions under which the Lasso estimator is selection consistent for the variables in the linear part of the model. In addition, we derive the rate of convergence of the estimator of the nonparametric function. We conduct simulation studies to evaluate the finite sample performance of variable selection and nonparametric function estimation.  相似文献   

11.
We propose a semiparametric Wald statistic to test the validity of logistic regression models based on case-control data. The test statistic is constructed using a semiparametric ROC curve estimator and a nonparametric ROC curve estimator. The statistic has an asymptotic chisquared distribution and is an alternative to the Kolmogorov-Smirnov-type statistic proposed by Qin and Zhang in 1997, the chi-squared-type statistic proposed by Zhang in 1999 and the information matrix test statistic proposed by Zhang in 2001. The statistic is easy to compute in the sense that it requires none of the following methods: using a bootstrap method to find its critical values, partitioning the sample data or inverting a high-dimensional matrix. We present some results on simulation and on analysis of two real examples. Moreover, we discuss how to extend our statistic to a family of statistics and how to construct its Kolmogorov-Smirnov counterpart. This work was supported by the 11.5 Natural Scientific Plan (Grant No. 2006BAD09A04) and Nanjing University Start Fund (Grant No. 020822410110)  相似文献   

12.

This paper is concerned with the error density estimation in high-dimensional sparse linear model, where the number of variables may be larger than the sample size. An improved two-stage refitted cross-validation procedure by random splitting technique is used to obtain the residuals of the model, and then traditional kernel density method is applied to estimate the error density. Under suitable sparse conditions, the large sample properties of the estimator including the consistency and asymptotic normality, as well as the law of the iterated logarithm are obtained. Especially, we gave the relationship between the sparsity and the convergence rate of the kernel density estimator. The simulation results show that our error density estimator has a good performance. A real data example is presented to illustrate our methods.

  相似文献   

13.
An existence of change point in a sequence of temporally ordered functional data demands more attention in its statistical analysis to make a better use of it. Introducing a dynamic estimator of covariance kernel, we propose a new methodology for testing an existence of change in the mean of temporally ordered functional data. Though a similar estimator is used for the covariance in finite dimension, we introduce it for the independent and weakly dependent functional data in this context for the first time. From this viewpoint, the proposed estimator of covariance kernel is more natural one when the sequence of functional data may possess a change point. We prove that the proposed test statistics are asymptotically pivotal under the null hypothesis and consistent under the alternative. It is shown that our testing procedures outperform the existing ones in terms of power and provide satisfactory results when applied to real data.  相似文献   

14.
在线性混合效应模型下, 方差分析(ANOVA) 估计和谱分解(SD) 估计对构造精确检验和广义P-值枢轴量起着非常重要的作用. 尽管这两估计分别基于不同的方法, 但它们共享许多类似的优点, 如无偏性和有精确的表达式等. 本文借助于已得到的协方差阵的谱分解结果, 揭示了平衡数据一般线性混合效应模型下ANOVA 估计与SD 估计的关系, 并分别针对协方差阵两种结构: 套结构和多项分类随机效应结构, 给出了ANOVA 估计与SD 估计等价的充分必要条件.  相似文献   

15.
Lin and Zhang (J. Roy. Statist. Soc. Ser. B 61 (1999) 381) proposed the generalized additive mixed model (GAMM) as a framework for analysis of correlated data, where normally distributed random effects are used to account for correlation in the data, and proposed to use double penalized quasi-likelihood (DPQL) to estimate the nonparametric functions in the model and marginal likelihood to estimate the smoothing parameters and variance components simultaneously. However, the normal distributional assumption for the random effects may not be realistic in many applications, and it is unclear how violation of this assumption affects ensuing inferences for GAMMs. For a particular class of GAMMs, we propose a conditional estimation procedure built on a conditional likelihood for the response given a sufficient statistic for the random effect, treating the random effect as a nuisance parameter, which thus should be robust to its distribution. In extensive simulation studies, we assess performance of this estimator under a range of conditions and use it as a basis for comparison to DPQL to evaluate the impact of violation of the normality assumption. The procedure is illustrated with application to data from the Multicenter AIDS Cohort Study (MACS).  相似文献   

16.
In a high-dimensional linear regression model, we propose a new procedure for testing statistical significance of a subset of regression coefficients. Specifically, we employ the partial covariances between the response variable and the tested covariates to obtain a test statistic. The resulting test is applicable even if the predictor dimension is much larger than the sample size. Under the null hypothesis, together with boundedness and moment conditions on the predictors, we show that the proposed test statistic is asymptotically standard normal, which is further supported by Monte Carlo experiments. A similar test can be extended to generalized linear models. The practical usefulness of the test is illustrated via an empirical example on paid search advertising.  相似文献   

17.
The analysis of finite mixture models for exponential repeated data is considered. The mixture components correspond to different unknown groups of the statistical units. Dependency and variability of repeated data are taken into account through random effects. For each component, an exponential mixed model is thus defined. When considering parameter estimation in this mixture of exponential mixed models, the EM-algorithm cannot be directly used since the marginal distribution of each mixture component cannot be analytically derived. In this paper, we propose two parameter estimation methods. The first one uses a linearisation specific to the exponential distribution hypothesis within each component. The second approach uses a Metropolis–Hastings algorithm as a building block of a general MCEM-algorithm.  相似文献   

18.

In this article, we propose two classes of semiparametric mixture regression models with single-index for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general, and many of the recently proposed semiparametric/nonparametric mixture regression models are indeed special cases of the new models. Backfitting estimates and the corresponding modified EM algorithms are proposed to achieve optimal convergence rates for both parametric and nonparametric parts. We establish the identifiability results of the proposed two models and investigate the asymptotic properties of the proposed estimation procedures. Simulation studies are conducted to demonstrate the finite sample performance of the proposed models. Two real data applications using the new models reveal some interesting findings.

  相似文献   

19.
Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.  相似文献   

20.
Patilea and Rolin (Ann Stat 34(2):925–938, 2006) proposed a product-limit estimator of the survival function for twice censored data. In this article, based on a modified self-consistent (MSC) approach, we propose an alternative estimator, the MSC estimator. The asymptotic properties of the MSC estimator are derived. A simulation study is conducted to compare the performance between the two estimators. Simulation results indicate that the MSC estimator outperforms the product-limit estimator and its advantage over the product-limit estimator can be very significant when right censoring is heavy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号