首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article develops a generalization of the scatterplot matrix based on the recognition that most datasets include both categorical and quantitative information. Traditional grids of scatterplots often obscure important features of the data when one or more variables are categorical but coded as numerical. The generalized pairs plot offers a range of displays of paired combinations of categorical and quantitative variables. A mosaic plot, fluctuation diagram, or faceted bar chart may be used to display two categorical variables. A side-by-side boxplot, stripplot, faceted histogram, or density plot helps visualize a categorical and a quantitative variable. A traditional scatterplot is suitable for displaying a pair of numerical variables, but options also support density contours or annotating summary statistics such as the correlation and number of missing values, for example. By combining these, the generalized pairs plot may help to reveal structure in multivariate data that otherwise might go unnoticed in the process of exploratory data analysis. Two different R packages provide implementations of the generalized pairs plot, gpairs and GGally. Supplementary materials for this article are available online on the journal web site.  相似文献   

2.
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.  相似文献   

3.
Discussion     
This article proposes a new hybrid visualization technique that integrates a frequency-based model and a generalized parallel coordinate plot (GPCP), thus mitigating the visual cluttering of GPCP. In the new technique, a GPCP’s profile lines (or curves) with similar frequencies are detected and saturated with appropriate color intensity corresponding to the frequencies. The technique may be employed to enhance a family of visualization tools—the Andrews plot and scatterplot matrix, for example. In addition to the new technique’s efficiency in reducing visual clutter in the multivariate data visualization techniques, it is computationally feasible, easy to implement, and has important mathematical and statistical properties. The reliability and accuracy of the technique are demonstrated through extensive experiments on challenging datasets, both simulated and real. These datasets are high in dimensions and large so that they cannot be explored with GPCP or frequency-based techniques alone.

The datasets for pollen, OUT5D, and California housing are available in the online supplements.  相似文献   

4.
Abstract

This article first illustrates the use of mosaic displays for the analysis of multiway contingency tables. We then introduce several extensions of mosaic displays designed to integrate graphical methods for categorical data with those used for quantitative data. The scatterplot matrix shows all pairwise (bivariate marginal) views of a set of variables in a coherent display. One analog for categorical data is a matrix of mosaic displays showing some aspect of the bivariate relation between all pairs of variables. The simplest case shows the bivariate marginal relation for each pair of variables. Another case shows the conditional relation between each pair, with all other variables partialled out. For quantitative data this represents (a) a visualization of the conditional independence relations studied by graphical models, and (b) a generalization of partial residual plots. The conditioning plot, or coplot shows a collection of partial views of several quantitative variables, conditioned by the values of one or more other variables. A direct analog of the coplot for categorical data is an array of mosaic plots of the dependence among two or more variables, stratified by the values of one or more given variables. Each such panel then shows the partial associations among the foreground variables; the collection of such plots shows how these associations change as the given variables vary.  相似文献   

5.
A new variable selection algorithm is developed for clustering based on mode association. In conventional mixture-model-based clustering, each mixture component is treated as one cluster and the separation between clusters is usually measured by the ratio of between- and within-component dispersion. In this article, we allow one cluster to contain several components depending on whether they merge into one mode. The extent of separation between clusters is quantified using critical points on the ridgeline between two modes, which reflects the exact geometry of the density function. The computational foundation consists of the recently developed Modal expectation–maximization (MEM) algorithm which solves the modes of a Gaussian mixture density, and the Ridgeline expectation–maximization (REM) algorithm which solves the ridgeline passing through the critical points of the mixed density of two unimode clusters. Forward selection is used to find a subset of variables that maximizes an aggregated index of pairwise cluster separability. Theoretical analysis of the procedure is provided. We experiment with both simulated and real datasets and compare with several state-of-the-art variable selection algorithms. Supplemental materials including an R-package, datasets, and appendices for proofs are available online.  相似文献   

6.
A simple measure of similarity for the construction of the market graph is proposed. The measure is based on the probability of the coincidence of the signs of the stock returns. This measure is robust, has a simple interpretation, is easy to calculate and can be used as measure of similarity between any number of random variables. For the case of pairwise similarity the connection of this measure with the sign correlation of Fechner is noted. The properties of the proposed measure of pairwise similarity in comparison with the classic Pearson correlation are studied. The simple measure of pairwise similarity is applied (in parallel with the classic correlation) for the study of Russian and Swedish market graphs. The new measure of similarity for more than two random variables is introduced and applied to the additional deeper analysis of Russian and Swedish markets. Some interesting phenomena for the cliques and independent sets of the obtained market graphs are observed.  相似文献   

7.
We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parameterization of the model. Supplementary materials for this article are available online.  相似文献   

8.
A penalized approach is proposed for performing large numbers of parallel nonparametric analyses of either of two types: restricted likelihood ratio tests of a parametric regression model versus a general smooth alternative, and nonparametric regression. Compared with naïvely performing each analysis in turn, our techniques reduce computation time dramatically. Viewing the large collection of scatterplot smooths produced by our methods as functional data, we develop a clustering approach to summarize and visualize these results. Our approach is applicable to ultra-high-dimensional data, particularly data acquired by neuroimaging; we illustrate it with an analysis of developmental trajectories of functional connectivity at each of approximately 70,000 brain locations. Supplementary materials, including an appendix and an R package, are available online.  相似文献   

9.
In this article, we propose a new framework for matrix factorization based on principal component analysis (PCA) where sparsity is imposed. The structure to impose sparsity is defined in terms of groups of correlated variables found in correlation matrices or maps. The framework is based on three new contributions: an algorithm to identify the groups of variables in correlation maps, a visualization for the resulting groups, and a matrix factorization. Together with a method to compute correlation maps with minimum noise level, referred to as missing-data for exploratory data analysis (MEDA), these three contributions constitute a complete matrix factorization framework. Two real examples are used to illustrate the approach and compare it with PCA, sparse PCA, and structured sparse PCA. Supplementary materials for this article are available online.  相似文献   

10.
This paper presents an overview of methods for the analysis of data structured in blocks of variables or in groups of individuals. More specifically, regularized generalized canonical correlation analysis (RGCCA), which is a unifying approach for multiblock data analysis, is extended to be also a unifying tool for multigroup data analysis. The versatility and usefulness of our approach is illustrated on two real datasets.  相似文献   

11.
We propose using graph theoretic results to develop an infrastructure that tracks movement from a display of one set of variables to another. The illustrative example throughout is the real-time morphing of one scatterplot into another. Hurley and Oldford (J Comput Graph Stat 2010) made extensive use of the graph having variables as nodes and edges indicating a paired relationship between them. The present paper introduces several new graphs derivable from this one whose traversals can be described as particular movements through high dimensional spaces. These are connected to known results in graph theory and the graph theoretic results applied to the problem of visualizing high-dimensional data.  相似文献   

12.
Three-dimensional data arrays (collections of individual data matrices) are increasingly prevalent in modern data and pose unique challenges to pattern extraction and visualization. This article introduces a biclustering technique for exploration and pattern detection in such complex structured data. The proposed framework couples the popular plaid model together with tools from functional data analysis to guide the estimation of bicluster responses over the array. We present an efficient algorithm that first detects biclusters that exhibit strong deviations for some data matrices, and then estimates their responses over the entire data array. Altogether, the framework is useful to home in on and display underlying structure and its evolution over conditions/time. The methods are scalable to large datasets, and can accommodate a variety of dynamic patterns. The proposed techniques are illustrated on gene expression data and bilateral trade networks. Supplementary materials are available online.  相似文献   

13.
罚模型聚类实现了在聚类过程中精简变量的目标,同时如何识别聚类的有效变量成了一个新的问题.在这个问题上,已有的研究有成对罚模型,模型处理了各类数据同方差的情况.考察了异方差情况下的变量选择问题,针对异方差数据提出了两种新的模型,并给出模型的解和算法.模拟数据分析结果表明,异方差数据上两个新模型都有更好的表现.  相似文献   

14.
We propose a Bayesian approach for inference in the multivariate probit model, taking into account the association structure between binary observations. We model the association through the correlation matrix of the latent Gaussian variables. Conditional independence is imposed by setting some off-diagonal elements of the inverse correlation matrix to zero and this sparsity structure is modeled using a decomposable graphical model. We propose an efficient Markov chain Monte Carlo algorithm relying on a parameter expansion scheme to sample from the resulting posterior distribution. This algorithm updates the correlation matrix within a simple Gibbs sampling framework and allows us to infer the correlation structure from the data, generalizing methods used for inference in decomposable Gaussian graphical models to multivariate binary observations. We demonstrate the performance of this model and of the Markov chain Monte Carlo algorithm on simulated and real datasets. This article has online supplementary materials.  相似文献   

15.
Many graphical methods for displaying multivariate data consist of arrangements of multiple displays of one or two variables; scatterplot matrices and parallel coordinates plots are two such methods. In principle these methods generalize to arbitrary numbers of variables but become difficult to interpret for even moderate numbers of variables. This article demonstrates that the impact of high dimensions is much less severe when the component displays are clustered together according to some index of merit. Effectively, this clustering reduces the dimensionality and makes interpretation easier. For scatterplot matrices and parallel coordinates plots clustering of component displays is achieved by finding suitable permutations of the variables. I discuss algorithms based on cluster analysis for finding permutations, and present examples using various indices of merit.  相似文献   

16.
Abstract

Trellis display is a framework for the visualization of data. Its most prominent aspect is an overall visual design, reminiscent of a garden trelliswork, in which panels are laid out into rows, columns, and pages. On each panel of the trellis, a subset of the data is graphed by a display method such as a scatterplot, curve plot, boxplot, 3-D wireframe, normal quantile plot, or dot plot. Each panel shows the relationship of certain variables conditional on the values of other variables. A number of display methods employed in the visual design of Trellis display enable it to succeed in uncovering the structure of data even when the structure is quite complicated. For example, Trellis display provides a powerful mechanism for understanding interactions in studies of how a response depends on explanatory variables. Three examples demonstrate this; in each case, we make important discoveries not appreciated in the original analyses. Several control methods are also essential to Trellis display. A control method is a technique for specifying information so that a display can be drawn. The control methods of Trellis display form a basic conceptual framework that can be used in designing software. We have demonstrated the viability of the control methods by implementing them in the S/S-PLUS system for graphics and data analysis, but they can be implemented in any software system with a basic capability for drawing graphs.  相似文献   

17.
Abstract

We present a method for graphically displaying regression data with Bernoulli responses. The method, which is based on the use of grayscale graphics to visualize contributions to a likelihood function, provides an analog of a scatterplot for logistic regression, as well as probit analysis. Furthermore, the method may be used in place of a traditional scatterplot in situations where such plots are often used.  相似文献   

18.
Fitting hierarchical Bayesian models to spatially correlated datasets using Markov chain Monte Carlo (MCMC) techniques is computationally expensive. Complicated covariance structures of the underlying spatial processes, together with high-dimensional parameter space, mean that the number of calculations required grows cubically with the number of spatial locations at each MCMC iteration. This necessitates the need for efficient model parameterizations that hasten the convergence and improve the mixing of the associated algorithms. We consider partially centred parameterizations (PCPs) which lie on a continuum between what are known as the centered (CP) and noncentered parameterizations (NCP). By introducing a weight matrix we remove the conditional posterior correlation between the fixed and the random effects, and hence construct a PCP which achieves immediate convergence for a three-stage model, based on multiple Gaussian processes with known covariance parameters. When the covariance parameters are unknown we dynamically update the parameterization within the sampler. The PCP outperforms both the CP and the NCP and leads to a fully automated algorithm which has been demonstrated in two simulation examples. The effectiveness of the spatially varying PCP is illustrated with a practical dataset of nitrogen dioxide concentration levels. Supplemental materials consisting of appendices, datasets, and computer code to reproduce the results are available online.  相似文献   

19.
The central objective of this paper is to develop a transparent, consistent, self-contained, and stable country risk rating model, closely approximating the country risk ratings provided by Standard and Poor’s (S&P). The model should be non-recursive, i.e., it should not rely on the previous years’ S&P ratings. The set of variables selected here includes not only economic-financial but also political variables. We propose a new model based on the novel combinatorial-logical technique of Logical Analysis of Data (which derives a new rating system only from the qualitative information representing pairwise comparisons of country riskiness). We also develop a method allowing to derive a rating system that has any desired level of granularity. The accuracy of the proposed model’s predictions, measured by its correlation coefficients with the S&P ratings, and confirmed by k-folding cross-validation, exceeds 95%. The stability of the constructed non-recursive model is shown in three ways: by the correlation of the predictions with those of other agencies (Moody’s and The Institutional Investor), by predicting 1999 ratings using the non-recursive model derived from the 1998 dataset applied to the 1999 data, and by successfully predicting the ratings of several previously non-rated countries. This study provides new insights on the importance of variables by supporting the necessity of including in the analysis, in addition to economic variables, also political variables (in particular “political stability”), and by identifying “financial depth and efficiency” as a new critical factor in assessing country risk.  相似文献   

20.
A simple and yet powerful method is presented to estimate nonlinearly and nonparametrically the components of additive models using wavelets. The estimator enjoys the good statistical and computational properties of the Waveshrink scatterplot smoother and it can be efficiently computed using the block coordinate relaxation optimization technique. A rule for the automatic selection of the smoothing parameters, suitable for data mining of large datasets, is derived. The wavelet-based method is then extended to estimate generalized additive models. A primal-dual log-barrier interior point algorithm is proposed to solve the corresponding convex programming problem. Based on an asymptotic analysis, a rule for selecting the smoothing parameters is derived, enabling the estimator to be fully automated in practice. We illustrate the finite sample property with a Gaussian and a Poisson simulation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号