首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy, and medicine. Although trees are estimated, their uncertainties are generally discarded in statistical models for tree-valued data. Here, we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information) and intrinsically (through the tree estimates themselves). The latter case is applicable to any procedure for tree estimation, and thus has broad relevance to the entire field of phylogenetics. The importance of accounting for tree uncertainty in tree space is demonstrated in two case studies. In the first instance, differences between gene trees are small relative to their uncertainties, while in the second, the differences are relatively large. Our main goal is visualization of tree uncertainty, and we demonstrate advantages of our method with respect to reproducibility, speed, and preservation of topological differences compared to visualization based on multidimensional scaling. The proposal highlights that phylogenetic trees are estimated in an extremely high-dimensional space, resulting in uncertainty information that cannot be discarded. Most importantly, it is a method that allows biologists to diagnose whether differences between gene trees are biologically meaningful or due to uncertainty in estimation.  相似文献   

2.
Three-dimensional data arrays (collections of individual data matrices) are increasingly prevalent in modern data and pose unique challenges to pattern extraction and visualization. This article introduces a biclustering technique for exploration and pattern detection in such complex structured data. The proposed framework couples the popular plaid model together with tools from functional data analysis to guide the estimation of bicluster responses over the array. We present an efficient algorithm that first detects biclusters that exhibit strong deviations for some data matrices, and then estimates their responses over the entire data array. Altogether, the framework is useful to home in on and display underlying structure and its evolution over conditions/time. The methods are scalable to large datasets, and can accommodate a variety of dynamic patterns. The proposed techniques are illustrated on gene expression data and bilateral trade networks. Supplementary materials are available online.  相似文献   

3.
Time series data with periodic trends like daily temperatures or sales of seasonal products can be seen in periods fluctuating between highs and lows throughout the year. Generalized least squares estimators are often computed for such time series data as these estimators have minimum variance among all linear unbiased estimators. However, the generalized least squares solution can require extremely demanding computation when the data is large. This paper studies an efficient algorithm for generalized least squares estimation in periodic trended regression with autoregressive errors. We develop an algorithm that can substantially simplify generalized least squares computation by manipulating large sets of data into smaller sets. This is accomplished by coining a structured matrix for dimension reduction. Simulations show that the new computation methods using our algorithm can drastically reduce computing time. Our algorithm can be easily adapted to big data that show periodic trends often pertinent to economics, environmental studies, and engineering practices.  相似文献   

4.
This article introduces a classification tree algorithm that can simultaneously reduce tree size, improve class prediction, and enhance data visualization. We accomplish this by fitting a bivariate linear discriminant model to the data in each node. Standard algorithms can produce fairly large tree structures because they employ a very simple node model, wherein the entire partition associated with a node is assigned to one class. We reduce the size of our trees by letting the discriminant models share part of the data complexity. Being themselves classifiers, the discriminant models can also help to improve prediction accuracy. Finally, because the discriminant models use only two predictor variables at a time, their effects are easily visualized by means of two-dimensional plots. Our algorithm does not simply fit discriminant models to the terminal nodes of a pruned tree, as this does not reduce the size of the tree. Instead, discriminant modeling is carried out in all phases of tree growth and the misclassification costs of the node models are explicitly used to prune the tree. Our algorithm is also distinct from the “linear combination split” algorithms that partition the data space with arbitrarily oriented hyperplanes. We use axis-orthogonal splits to preserve the interpretability of the tree structures. An extensive empirical study with real datasets shows that, in general, our algorithm has better prediction power than many other tree or nontree algorithms.  相似文献   

5.
In this article, we propose a new framework for matrix factorization based on principal component analysis (PCA) where sparsity is imposed. The structure to impose sparsity is defined in terms of groups of correlated variables found in correlation matrices or maps. The framework is based on three new contributions: an algorithm to identify the groups of variables in correlation maps, a visualization for the resulting groups, and a matrix factorization. Together with a method to compute correlation maps with minimum noise level, referred to as missing-data for exploratory data analysis (MEDA), these three contributions constitute a complete matrix factorization framework. Two real examples are used to illustrate the approach and compare it with PCA, sparse PCA, and structured sparse PCA. Supplementary materials for this article are available online.  相似文献   

6.
We propose a new variational Bayes (VB) estimator for high-dimensional copulas with discrete, or a combination of discrete and continuous, margins. The method is based on a variational approximation to a tractable augmented posterior and is faster than previous likelihood-based approaches. We use it to estimate drawable vine copulas for univariate and multivariate Markov ordinal and mixed time series. These have dimension rT, where T is the number of observations and r is the number of series, and are difficult to estimate using previous methods. The vine pair-copulas are carefully selected to allow for heteroscedasticity, which is a feature of most ordinal time series data. When combined with flexible margins, the resulting time series models also allow for other common features of ordinal data, such as zero inflation, multiple modes, and under or overdispersion. Using six example series, we illustrate both the flexibility of the time series copula models and the efficacy of the VB estimator for copulas of up to 792 dimensions and 60 parameters. This far exceeds the size and complexity of copula models for discrete data that can be estimated using previous methods. An online appendix and MATLAB code implementing the method are available as supplementary materials.  相似文献   

7.
Summary  An increasingly important problem in exploratory data analysis and visualization is that of scale; more and more data sets are much too large to analyze using traditional techniques, either in terms of the number of variables or the number of records. One approach to addressing this problem is the development and use of multiresolution strategies, where we represent the data at different levels of abstraction or detail through aggregation and summarization. In this paper we present an overview of our recent and current activities in the development of a multiresolution exploratory visualization environment for large-scale multivariate data. We have developed visualization, interaction, and data management techniques for effectively dealing with data sets that contain millions of records and/or hundreds of dimensions, and propose methods for applying similar approaches to extend the system to handle nominal as well as ordinal data.  相似文献   

8.
Our randomized additive preconditioners are readily available and regularly facilitate the solution of linear systems of equations and eigen-solving for a very large class of input matrices. We study the generation of such preconditioners and their impact on the rank and the condition number of a matrix. We also propose some techniques for their refinement and two alternative versions of randomized preprocessing. Our analysis and experiments show the power of our approach even where we employ weak randomization, that is generate sparse and structured preconditioners, defined by a small number of random parameters.  相似文献   

9.
This article presents a method for visualization of multivariate functions. The method is based on a tree structure—called the level set tree—built from separated parts of level sets of a function. The method is applied for visualization of estimates of multivarate density functions. With different graphical representations of level set trees we may visualize the number and location of modes, excess masses associated with the modes, and certain shape characteristics of the estimate. Simulation examples are presented where projecting data to two dimensions does not help to reveal the modes of the density, but with the help of level set trees one may detect the modes. I argue that level set trees provide a useful method for exploratory data analysis.  相似文献   

10.
Single-index models have found applications in econometrics and biometrics, where multidimensional regression models are often encountered. This article proposes a nonparametric estimation approach that combines wavelet methods for nonequispaced designs with Bayesian models. We consider a wavelet series expansion of the unknown regression function and set prior distributions for the wavelet coefficients and the other model parameters. To ensure model identifiability, the direction parameter is represented via its polar coordinates. We employ ad hoc hierarchical mixture priors that perform shrinkage on wavelet coefficients and use Markov chain Monte Carlo methods for a posteriori inference. We investigate an independence-type Metropolis-Hastings algorithm to produce samples for the direction parameter. Our method leads to simultaneous estimates of the link function and of the index parameters. We present results on both simulated and real data, where we look at comparisons with other methods.  相似文献   

11.
针对七种现实约束的集装箱三维多箱异构货物装载优化问题,提出了一种基于 “块”和“空间”的启发式搜索算法。算法采用树搜索策略,根据可用空间,对每一次搜索的货物块进行评估,得到最佳的货物块,直到无可用空间或无可装载的货物为止。基于开放式标准测试数据的计算结果表明,该算法在时间效率和体积利用率上均优于已有的同类研究。并基于Net平台开发了一款3D装箱布局优化可视化软件,已在相关物流企业中得到推广应用,验证了算法的实用性。  相似文献   

12.
Certain practical and theoretical challenges surround the estimation of finite mixture models. One such challenge is how to determine the number of components when this is not assumed a priori. Available methods in the literature are primarily numerical and lack any substantial visualization component. Traditional numerical methods include the calculation of information criteria and bootstrapping approaches; however, such methods have known technical issues regarding the necessary regularity conditions for testing the number of components. The ability to visualize an appropriate number of components for a finite mixture model could serve to supplement the results from traditional methods or provide visual evidence when results from such methods are inconclusive. Our research fills this gap through development of a visualization tool, which we call a mixturegram. This tool is easy to implement and provides a quick way for researchers to assess the number of components for their hypothesized mixture model. Mixtures of univariate or multivariate data can be assessed. We validate our visualization assessments by comparing with results from information criteria and an ad hoc selection criterion based on calculations used for the mixturegram. We also construct the mixturegram for two datasets.  相似文献   

13.
We consider in this paper the efficient ways to generate multi-stage scenario trees. A general modified K-means clustering method is first presented to generate the scenario tree with a general structure. This method takes the time dependency of the simulated path into account. Based on the traditional and modified K-means analyses, the moment matching of multi-stage scenario trees is described as a linear programming (LP) problem. By simultaneously utilizing simulation, clustering, non-linear time series and moment matching skills, a sequential generation method and another new hybrid approach which can generate the whole multi-stage tree right off are proposed. The advantages of these new methods are: the vector autoregressive and multivariate generalized autoregressive conditional heteroscedasticity (VAR-MGARCH) model is adopted to properly reflect the inter-stage dependency and the time-varying volatilities of the data process, the LP-based moment matching technique ensures that the scenario tree generation problem can be solved more efficiently and the tree scale can be further controlled, and in the meanwhile, the statistical properties of the random data process are maintained properly. What is more important, our new LP methods can guarantee at least two branches are derived from each non-leaf node and thus overcome the drawback in relevant papers. We carry out a series of numerical experiments and apply the scenario tree generation methods to a portfolio management problem, which demonstrate the practicality, efficiency and advantages of our new approaches over other models or methods.  相似文献   

14.
Regression density estimation is the problem of flexibly estimating a response distribution as a function of covariates. An important approach to regression density estimation uses finite mixture models and our article considers flexible mixtures of heteroscedastic regression (MHR) models where the response distribution is a normal mixture, with the component means, variances, and mixture weights all varying as a function of covariates. Our article develops fast variational approximation (VA) methods for inference. Our motivation is that alternative computationally intensive Markov chain Monte Carlo (MCMC) methods for fitting mixture models are difficult to apply when it is desired to fit models repeatedly in exploratory analysis and model choice. Our article makes three contributions. First, a VA for MHR models is described where the variational lower bound is in closed form. Second, the basic approximation can be improved by using stochastic approximation (SA) methods to perturb the initial solution to attain higher accuracy. Third, the advantages of our approach for model choice and evaluation compared with MCMC-based approaches are illustrated. These advantages are particularly compelling for time series data where repeated refitting for one-step-ahead prediction in model choice and diagnostics and in rolling-window computations is very common. Supplementary materials for the article are available online.  相似文献   

15.
An resilience optimal evaluation of financial portfolios implies having plausible hypotheses about the multiple interconnections between the macroeconomic variables and the risk parameters. In this article, we propose a graphical model for the reconstruction of the causal structure that links the multiple macroeconomic variables and the assessed risk parameters, it is this structure that we call stress testing network. In this model, the relationships between the macroeconomic variables and the risk parameter define a “relational graph” among their time‐series, where related time‐series are connected by an edge. Our proposal is based on the temporal causal models, but unlike, we incorporate specific conditions in the structure which correspond to intrinsic characteristics this type of networks. Using the proposed model and given the high‐dimensional nature of the problem, we used regularization methods to efficiently detect causality in the time‐series and reconstruct the underlying causal structure. In addition, we illustrate the use of model in credit risk data of a portfolio. Finally, we discuss its uses and practical benefits in stress testing.  相似文献   

16.
Tree-structured models have been widely used because they function as interpretable prediction models that offer easy data visualization. A number of tree algorithms have been developed for univariate response data and can be extended to analyze multivariate response data. We propose a tree algorithm by combining the merits of a tree-based model and a mixed-effects model for longitudinal data. We alleviate variable selection bias through residual analysis, which is used to solve problems that exhaustive search approaches suffer from, such as undue preference to split variables with more possible splits, expensive computational cost, and end-cut preference. Most importantly, our tree algorithm discovers trends over time on each of the subspaces from recursive partitioning, while other tree algorithms predict responses. We investigate the performance of our algorithm with both simulation and real data studies. We also develop an R package melt that can be used conveniently and freely. Additional results are provided as online supplementary material.  相似文献   

17.
We investigate a class of time discretization schemes called “ETD Runge–Kutta methods,” where the linear terms of an ordinary differential equation are treated exactly, while the other terms are numerically integrated by a one-step method. These schemes, proposed by previous authors, can be regarded as modified Runge–Kutta methods whose coefficients are matrices instead of scalars. From this viewpoint, we reexamine the notion of consistency, convergence and order to provide a mathematical foundation for new methods. Applying the rooted tree analysis, expansion theorems of both the strict and numerical solutions are proved, and two types of order conditions are defined. Several classes of formulas with up to four stages that satisfy the conditions are constructed, and it is shown that the power series of matrices, employed as their coefficients, can be determined using the order conditions.  相似文献   

18.
Multidimensional multivariate data have been studied in different areas for quite some time. Commonly, the analysis goal is not to look into individual records but to understand the distribution of the records at large and to find clusters of records that exhibit correlations between dimensions or variables. We propose a visualization method that operates on density rather than individual records. To not restrict our search for clusters, we compute density in the given multidimensional space. Clusters are formed by areas of high density. We present an approach that automatically computes a hierarchical tree of high density clusters. For visualization purposes, we propose a method to project the multidimensional clusters to a 2D or 3D layout. The projection method uses an optimized star coordinates layout. The optimization procedure minimizes the overlap of projected clusters and maximally maintains the cluster shapes, compactness, and distribution. The star coordinate visualization allows for an interactive analysis of the distribution of clusters and comprehension of the relations between clusters and the original dimensions. Clusters are being visualized using nested sequences of density level sets leading to a quantitative understanding of information content, patterns, and relationships.  相似文献   

19.
Many problems in genomics are related to variable selection where high-dimensional genomic data are treated as covariates. Such genomic covariates often have certain structures and can be represented as vertices of an undirected graph. Biological processes also vary as functions depending upon some biological state, such as time. High-dimensional variable selection where covariates are graph-structured and underlying model is nonparametric presents an important but largely unaddressed statistical challenge. Motivated by the problem of regression-based motif discovery, we consider the problem of variable selection for high-dimensional nonparametric varying-coefficient models and introduce a sparse structured shrinkage (SSS) estimator based on basis function expansions and a novel smoothed penalty function. We present an efficient algorithm for computing the SSS estimator. Results on model selection consistency and estimation bounds are derived. Moreover, finite-sample performances are studied via simulations, and the effects of high-dimensionality and structural information of the covariates are especially highlighted. We apply our method to motif finding problem using a yeast cell-cycle gene expression dataset and word counts in genes’ promoter sequences. Our results demonstrate that the proposed method can result in better variable selection and prediction for high-dimensional regression when the underlying model is nonparametric and covariates are structured. Supplemental materials for the article are available online.  相似文献   

20.
Many problems in genomics are related to variable selection where high-dimensional genomic data are treated as covariates. Such genomic covariates often have certain structures and can be represented as vertices of an undirected graph. Biological processes also vary as functions depending upon some biological state, such as time. High-dimensional variable selection where covariates are graph-structured and underlying model is nonparametric presents an important but largely unaddressed statistical challenge. Motivated by the problem of regression-based motif discovery, we consider the problem of variable selection for high-dimensional nonparametric varying-coefficient models and introduce a sparse structured shrinkage (SSS) estimator based on basis function expansions and a novel smoothed penalty function. We present an efficient algorithm for computing the SSS estimator. Results on model selection consistency and estimation bounds are derived. Moreover, finite-sample performances are studied via simulations, and the effects of high-dimensionality and structural information of the covariates are especially highlighted. We apply our method to motif finding problem using a yeast cell-cycle gene expression dataset and word counts in genes' promoter sequences. Our results demonstrate that the proposed method can result in better variable selection and prediction for high-dimensional regression when the underlying model is nonparametric and covariates are structured. Supplemental materials for the article are available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号