首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present a very fast algorithm for general matrix factorization of a data matrix for use in the statistical analysis of high-dimensional data via latent factors. Such data are prevalent across many application areas and generate an ever-increasing demand for methods of dimension reduction in order to undertake the statistical analysis of interest. Our algorithm uses a gradient-based approach which can be used with an arbitrary loss function provided the latter is differentiable. The speed and effectiveness of our algorithm for dimension reduction is demonstrated in the context of supervised classification of some real high-dimensional data sets from the bioinformatics literature.  相似文献   

2.
We propose fast and scalable statistical methods for the analysis of hundreds or thousands of high-dimensional vectors observed at multiple visits. The proposed inferential methods do not require loading the entire dataset at once in the computer memory and instead use only sequential access to data. This allows deployment of our methodology on low-resource computers where computations can be done in minutes on extremely large datasets. Our methods are motivated by and applied to a study where hundreds of subjects were scanned using Magnetic Resonance Imaging (MRI) at two visits roughly five years apart. The original data possess over ten billion measurements. The approach can be applied to any type of study where data can be unfolded into a long vector including densely observed functions and images. Supplemental materials are provided with source code for simulations, some technical details and proofs, and additional imaging results of the brain study.  相似文献   

3.
We show how to obtain a fast component-by-component construction algorithm for higher order polynomial lattice rules. Such rules are useful for multivariate quadrature of high-dimensional smooth functions over the unit cube as they achieve the near optimal order of convergence. The main problem addressed in this paper is to find an efficient way of computing the worst-case error. A general algorithm is presented and explicit expressions for base 2 are given. To obtain an efficient component-by-component construction algorithm we exploit the structure of the underlying cyclic group. We compare our new higher order multivariate quadrature rules to existing quadrature rules based on higher order digital nets by computing their worst-case error. These numerical results show that the higher order polynomial lattice rules improve upon the known constructions of quasi-Monte Carlo rules based on higher order digital nets.  相似文献   

4.
Many problems in genomics are related to variable selection where high-dimensional genomic data are treated as covariates. Such genomic covariates often have certain structures and can be represented as vertices of an undirected graph. Biological processes also vary as functions depending upon some biological state, such as time. High-dimensional variable selection where covariates are graph-structured and underlying model is nonparametric presents an important but largely unaddressed statistical challenge. Motivated by the problem of regression-based motif discovery, we consider the problem of variable selection for high-dimensional nonparametric varying-coefficient models and introduce a sparse structured shrinkage (SSS) estimator based on basis function expansions and a novel smoothed penalty function. We present an efficient algorithm for computing the SSS estimator. Results on model selection consistency and estimation bounds are derived. Moreover, finite-sample performances are studied via simulations, and the effects of high-dimensionality and structural information of the covariates are especially highlighted. We apply our method to motif finding problem using a yeast cell-cycle gene expression dataset and word counts in genes’ promoter sequences. Our results demonstrate that the proposed method can result in better variable selection and prediction for high-dimensional regression when the underlying model is nonparametric and covariates are structured. Supplemental materials for the article are available online.  相似文献   

5.
Inference for spatial generalized linear mixed models (SGLMMs) for high-dimensional non-Gaussian spatial data is computationally intensive. The computational challenge is due to the high-dimensional random effects and because Markov chain Monte Carlo (MCMC) algorithms for these models tend to be slow mixing. Moreover, spatial confounding inflates the variance of fixed effect (regression coefficient) estimates. Our approach addresses both the computational and confounding issues by replacing the high-dimensional spatial random effects with a reduced-dimensional representation based on random projections. Standard MCMC algorithms mix well and the reduced-dimensional setting speeds up computations per iteration. We show, via simulated examples, that Bayesian inference for this reduced-dimensional approach works well both in terms of inference as well as prediction; our methods also compare favorably to existing “reduced-rank” approaches. We also apply our methods to two real world data examples, one on bird count data and the other classifying rock types. Supplementary material for this article is available online.  相似文献   

6.
Highly structured generalised response models, such as generalised linear mixed models and generalised linear models for time series regression, have become an indispensable vehicle for data analysis and inference in many areas of application. However, their use in practice is hindered by high-dimensional intractable integrals. Quasi-Monte Carlo (QMC) is a dynamic research area in the general problem of high-dimensional numerical integration, although its potential for statistical applications is yet to be fully explored. We survey recent research in QMC, particularly lattice rules, and report on its application to highly structured generalised response models. New challenges for QMC are identified and new methodologies are developed. QMC methods are seen to provide significant improvements compared with ordinary Monte Carlo methods.   相似文献   

7.
We introduce graphical time series models for the analysis of dynamic relationships among variables in multivariate time series. The modelling approach is based on the notion of strong Granger causality and can be applied to time series with non-linear dependences. The models are derived from ordinary time series models by imposing constraints that are encoded by mixed graphs. In these graphs each component series is represented by a single vertex and directed edges indicate possible Granger-causal relationships between variables while undirected edges are used to map the contemporaneous dependence structure. We introduce various notions of Granger-causal Markov properties and discuss the relationships among them and to other Markov properties that can be applied in this context. Examples for graphical time series models include nonlinear autoregressive models and multivariate ARCH models.  相似文献   

8.
Abstract

The grand tour and projection pursuit are two methods for exploring multivariate data. We show how to combine them into a dynamic graphical tool for exploratory data analysis, called a projection pursuit guided tour. This tool assists in clustering data when clusters are oddly shaped and in finding general low-dimensional structure in high-dimensional, and in particular, sparse data. An example shows that the method, which is projection-based, can be quite powerful in situations that may cause grief for methods based on kernel smoothing. The projection pursuit guided tour is also useful for comparing and developing projection pursuit indexes and illustrating some types of asymptotic results.  相似文献   

9.
We propose a method for defining and measuring spatial contagion between two financial markets via conditional copulas. Some theoretical results on monotonicity and asymptotic properties of Gaussian copulas with respect to conditioning are presented. Next, we combine the spatial contagion approach with time series models. We investigate which model from a large family of multivariate GARCH is the best tool for modelling spatial contagion. In an empirical study, we show that among models designed for general fit, a two‐step model fitting procedure reduces the ability to describe the contagion effect. This is a feature of copula‐GARCH models. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
We propose a flexible class of models based on scale mixture of uniform distributions to construct shrinkage priors for covariance matrix estimation. This new class of priors enjoys a number of advantages over the traditional scale mixture of normal priors, including its simplicity and flexibility in characterizing the prior density. We also exhibit a simple, easy to implement Gibbs sampler for posterior simulation, which leads to efficient estimation in high-dimensional problems. We first discuss the theory and computational details of this new approach and then extend the basic model to a new class of multivariate conditional autoregressive models for analyzing multivariate areal data. The proposed spatial model flexibly characterizes both the spatial and the outcome correlation structures at an appealing computational cost. Examples consisting of both synthetic and real-world data show the utility of this new framework in terms of robust estimation as well as improved predictive performance. Supplementary materials are available online.  相似文献   

11.
Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimensional low sample size (HDLSS) data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of Type I error, in the important case where there are a few very large eigenvalues. This article addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of theoretical cluster index (TCI), and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.  相似文献   

12.
13.
Frequentist standard errors are a measure of uncertainty of an estimator, and the basis for statistical inferences. Frequestist standard errors can also be derived for Bayes estimators. However, except in special cases, the computation of the standard error of Bayesian estimators requires bootstrapping, which in combination with Markov chain Monte Carlo can be highly time consuming. We discuss an alternative approach for computing frequentist standard errors of Bayesian estimators, including importance sampling. Through several numerical examples we show that our approach can be much more computationally efficient than the standard bootstrap.  相似文献   

14.
We consider in this paper the efficient ways to generate multi-stage scenario trees. A general modified K-means clustering method is first presented to generate the scenario tree with a general structure. This method takes the time dependency of the simulated path into account. Based on the traditional and modified K-means analyses, the moment matching of multi-stage scenario trees is described as a linear programming (LP) problem. By simultaneously utilizing simulation, clustering, non-linear time series and moment matching skills, a sequential generation method and another new hybrid approach which can generate the whole multi-stage tree right off are proposed. The advantages of these new methods are: the vector autoregressive and multivariate generalized autoregressive conditional heteroscedasticity (VAR-MGARCH) model is adopted to properly reflect the inter-stage dependency and the time-varying volatilities of the data process, the LP-based moment matching technique ensures that the scenario tree generation problem can be solved more efficiently and the tree scale can be further controlled, and in the meanwhile, the statistical properties of the random data process are maintained properly. What is more important, our new LP methods can guarantee at least two branches are derived from each non-leaf node and thus overcome the drawback in relevant papers. We carry out a series of numerical experiments and apply the scenario tree generation methods to a portfolio management problem, which demonstrate the practicality, efficiency and advantages of our new approaches over other models or methods.  相似文献   

15.
This article proposes the generalized discrete autoregressive moving‐average (GDARMA) model as a parsimonious and universally applicable approach for stationary univariate or multivariate time series. The GDARMA model can be applied to any type of quantitative time series. It allows to compute moment properties in a unique way, and it exhibits the autocorrelation structure of the traditional ARMA model. This great flexibility is obtained by using data‐specific variation operators, which is illustrated for the most common types of time series data, such as counts, integers, reals, and compositional data. The practical potential of the GDARMA approach is demonstrated by considering a time series of integers regarding votes for a change of the interest rate, and a time series of compositional data regarding television market shares.  相似文献   

16.
Latent or unobserved phenomena pose a significant difficulty in data analysis as they induce complicated and confounding dependencies among a collection of observed variables. Factor analysis is a prominent multivariate statistical modeling approach that addresses this challenge by identifying the effects of (a small number of) latent variables on a set of observed variables. However, the latent variables in a factor model are purely mathematical objects that are derived from the observed phenomena, and they do not have any interpretation associated to them. A natural approach for attributing semantic information to the latent variables in a factor model is to obtain measurements of some additional plausibly useful covariates that may be related to the original set of observed variables, and to associate these auxiliary covariates to the latent variables. In this paper, we describe a systematic approach for identifying such associations. Our method is based on solving computationally tractable convex optimization problems, and it can be viewed as a generalization of the minimum-trace factor analysis procedure for fitting factor models via convex optimization. We analyze the theoretical consistency of our approach in a high-dimensional setting as well as its utility in practice via experimental demonstrations with real data.  相似文献   

17.
Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.  相似文献   

18.
The purpose of this article is to review the findings of Professor Fujikoshi which are primarily in multivariate analysis. He derived many asymptotic expansions for multivariate statistics which include MANOVA tests, dimensionality tests and latent roots under normality and nonnormality. He has made a large contribution in the study on theoretical accuracy for asymptotic expansions by deriving explicit error bounds. A large contribution has been also made in an important problem involving the selection of variables with introducing “no additional information hypotheses” in some multivariate models and the application of model selection criteria. Recently he is challenging to a high-dimensional statistical problem. He has been involved in other topics in multivariate analysis, such as power comparison of a class of tests, monotone transformations with improved approximations, etc.  相似文献   

19.
The L 1-median is a robust estimator of multivariate location with good statistical properties. Several algorithms for computing the L 1-median are available. Problem specific algorithms can be used, but also general optimization routines. The aim is to compare different algorithms with respect to their precision and runtime. This is possible because all considered algorithms have been implemented in a standardized manner in the open source environment R. In most situations, the algorithm based on the optimization routine NLM (non-linear minimization) clearly outperforms other approaches. Its low computation time makes applications for large and high-dimensional data feasible.  相似文献   

20.
Hamilton系统下基于相位误差的精细辛算法   总被引:1,自引:1,他引:0       下载免费PDF全文
Hamilton系统是一类重要的动力系统,辛算法(如生成函数法、SRK法、SPRK法、多步法等)是针对Hamilton系统所设计的具有保持相空间辛结构不变或保Hamilton函数不变的算法.但是,时域上,同阶的辛算法与Runge-Kutta法具有相同的数值精度,即辛算法在计算过程中也存在相位误差,导致时域上解的数值精度不高.经过长时间计算后,计算结果在时域上也会变得“面目全非”.为了提高辛算法在时域上解的精度,将精细算法引入到辛差分格式中,提出了基于相位误差的精细辛算法(HPD-symplectic method),这种算法满足辛格式的要求,因此在离散过程中具有保Hamilton系统辛结构的优良特性.同时,由于精细化时间步长,极大地减小了辛算法的相位误差,大幅度提高了时域上解的数值精度,几乎可以达到计算机的精度,误差为O(10-13).对于高低混频系统和刚性系统,常规的辛算法很难在较大的步长下同时实现对高低频精确仿真,精细辛算法通过精细计算时间步长,在大步长情况下,没有额外增加计算量,实现了高低混频的精确仿真.数值结果验证了此方法的有效性和可靠性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号