共查询到20条相似文献,搜索用时 0 毫秒
1.
《Journal of computational and graphical statistics》2013,22(2):444-472
We discuss methodology for multidimensional scaling (MDS) and its implementation in two software systems, GGvis and XGvis. MDS is a visualization technique for proximity data, that is, data in the form of N × N dissimilarity matrices. MDS constructs maps (“configurations,” “embeddings”) in IRk by interpreting the dissimilarities as distances. Two frequent sources of dissimilarities are high-dimensional data and graphs. When the dissimilarities are distances between high-dimensional objects, MDS acts as a (often nonlinear) dimension-reduction technique. When the dissimilarities are shortest-path distances in a graph, MDS acts as a graph layout technique. MDS has found recent attention in machine learning motivated by image databases (“Isomap”). MDS is also of interest in view of the popularity of “kernelizing” approaches inspired by Support Vector Machines (SVMs; “kernel PCA”).This article discusses the following general topics: (1) the stability and multiplicity of MDS solutions; (2) the analysis of structure within and between subsets of objects with missing value schemes in dissimilarity matrices; (3) gradient descent for optimizing general MDS loss functions (“Strain” and “Stress”); (4) a unification of classical (Strain-based) and distance (Stress-based) MDS.Particular topics include the following: (1) blending of automatic optimization with interactive displacement of configuration points to assist in the search for global optima; (2) forming groups of objects with interactive brushing to create patterned missing values in MDS loss functions; (3) optimizing MDS loss functions for large numbers of objects relative to a small set of anchor points (“external unfolding”); and (4) a non-metric version of classical MDS.We show applications to the mapping of computer usage data, to the dimension reduction of marketing segmentation data, to the layout of mathematical graphs and social networks, and finally to the spatial reconstruction of molecules. 相似文献
2.
《Journal of computational and graphical statistics》2013,22(4):788-806
Many graphical methods for displaying multivariate data consist of arrangements of multiple displays of one or two variables; scatterplot matrices and parallel coordinates plots are two such methods. In principle these methods generalize to arbitrary numbers of variables but become difficult to interpret for even moderate numbers of variables. This article demonstrates that the impact of high dimensions is much less severe when the component displays are clustered together according to some index of merit. Effectively, this clustering reduces the dimensionality and makes interpretation easier. For scatterplot matrices and parallel coordinates plots clustering of component displays is achieved by finding suitable permutations of the variables. I discuss algorithms based on cluster analysis for finding permutations, and present examples using various indices of merit. 相似文献
3.
For Gaussian process models, likelihood-based methods are often difficult to use with large irregularly spaced spatial datasets, because exact calculations of the likelihood for n observations require O(n3) operations and O(n2) memory. Various approximation methods have been developed to address the computational difficulties. In this article, we propose new, unbiased estimating equations (EE) based on score equation approximations that are both computationally and statistically efficient. We replace the inverse covariance matrix that appears in the score equations by a sparse matrix to approximate the quadratic forms, then set the resulting quadratic forms equal to their expected values to obtain unbiased EE. The sparse matrix is constructed by a sparse inverse Cholesky approach to approximate the inverse covariance matrix. The statistical efficiency of the resulting unbiased EE is evaluated both in theory and by numerical studies. Our methods are applied to nearly 90,000 satellite-based measurements of water vapor levels over a region in the Southeast Pacific Ocean. 相似文献
4.
《Journal of computational and graphical statistics》2013,22(3):529-546
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually. 相似文献
5.
《Journal of computational and graphical statistics》2013,22(4):829-852
This article presents techniques for constructing classifiers that combine statistical information from training data with tangent approximations to known transformations; it demonstrates the techniques by applying them to a face recognition task. Our approach is to build Bayes classifiers with approximate class-conditional probability densities for measured data. The high dimension of the measurements in modern classification problems such as speech or image recognition makes inferring probability densities from feasibly sized training datasets difficult. We address the difficulty by imposing severely simplifying assumptions and exploiting a priori information about transformations to which classification should be invariant. For the face recognition task, we used a five-parameter group of such transformations consisting of rotation, shifts, and scalings. On the face recognition task, a classifier based on our techniques has an error rate that is 20% lower than that of the best algorithm in a reference software distribution. 相似文献
6.
简单地介绍了多维标度方法的原理及理论,并介绍了加权多维标度方法的理论及原理,对亲属关系亲密的程度采用多维标度法进行了分析,展示了15种亲属关系可分为五大类,并由此得到在人们心中亲属的分类情况,根据这些分类来解释一些亲属间的关系,理解人际关系中的一些社会现象,帮助缓和彼此之间的一些矛盾. 相似文献
7.
Daniel A. Keim 《Journal of computational and graphical statistics》2013,22(1):58-77
Abstract An important goal of visualization technology is to support the exploration and analysis of very large amounts of data. This article describes a set of pixel-oriented visualization techniques that use each pixel of the display to visualize one data value and therefore allow the visualization of the largest amount of data possible. Most of the techniques have been specifically designed for visualizing and querying large data bases. The techniques may be divided into query-independent techniques that directly visualize the data (or a certain portion of it) and query-dependent techniques that visualize the data in the context of a specific query. Examples for the class of query-independent techniques are the screen-filling curve and recursive pattern techniques. The screen-filling curve techniques are based on the well-known Morton and Peano–Hilbert curve algorithms, and the recursive pattern technique is based on a generic recursive scheme, which generalizes a wide range of pixel-oriented arrangements for visualizing large data sets. Examples for the class of query-dependent techniques are the snake-spiral and snake-axes techniques, which visualize the distances with respect to a data base query and arrange the most relevant data items in the center of the display. In addition to describing the basic ideas of our techniques, we provide example visualizations generated by the various techniques, which demonstrate the usefulness of our techniques and show some of their advantages and disadvantages. 相似文献
8.
《International Journal of Approximate Reasoning》2014,55(7):1487-1501
Real-life data associated with experimental outcomes are not always real-valued. In particular, opinions, perceptions, ratings, etc., are often assumed to be vague in nature, especially when they come from human valuations. Fuzzy numbers have extensively been considered to provide us with a convenient tool to express these vague data. In analyzing fuzzy data from a statistical perspective one finds two key obstacles, namely, the nonlinearity associated with the usual arithmetic with fuzzy data and the lack of suitable models and limit results for the distribution of fuzzy-valued statistics. These obstacles can be frequently bypassed by using an appropriate metric between fuzzy data, the notion of random fuzzy set and a bootstrapped central limit theorem for general space-valued random elements. This paper aims to review these ideas and a methodology for the statistical analysis of fuzzy number data which has been developed along the last years. 相似文献
9.
The asymptotic distribution of branching type recursions (Ln) of the form
is investigated in the two-dimensional case. Here
is an independent copy of Ln−1 and A,B are random matrices jointly independent of
. The asymptotics of Ln after normalization are derived by a contraction method. The limiting distribution is characterized by a fixed point equation. The assumptions of the convergence theorem are checked in some examples using eigenvalue decompositions and computer algebra. 相似文献
10.
《Journal of computational and graphical statistics》2013,22(1):101-120
In high-dimensional classification problems, one is often interested in finding a few important discriminant directions in order to reduce the dimensionality. Fisher's linear discriminant analysis (LDA) is a commonly used method. Although LDA is guaranteed to find the best directions when each class has a Gaussian density with a common covariance matrix, it can fail if the class densities are more general. Using a likelihood-based interpretation of Fisher's LDA criterion, we develop a general method for finding important discriminant directions without assuming the class densities belong to any particular parametric family. We also show that our method can be easily integrated with projection pursuit density estimation to produce a powerful procedure for (reduced-rank) nonparametric discriminant analysis. 相似文献
11.
Parallel Variational Bayes for Large Datasets With an Application to Generalized Linear Mixed Models
Minh-Ngoc Tran David J. Nott Anthony Y. C. Kuk Robert Kohn 《Journal of computational and graphical statistics》2016,25(2):626-646
The article develops a hybrid variational Bayes (VB) algorithm that combines the mean-field and stochastic linear regression fixed-form VB methods. The new estimation algorithm can be used to approximate any posterior without relying on conjugate priors. We propose a divide and recombine strategy for the analysis of large datasets, which partitions a large dataset into smaller subsets and then combines the variational distributions that have been learned in parallel on each separate subset using the hybrid VB algorithm. We also describe an efficient model selection strategy using cross-validation, which is straightforward to implement as a by-product of the parallel run. The proposed method is applied to fitting generalized linear mixed models. The computational efficiency of the parallel and hybrid VB algorithm is demonstrated on several simulated and real datasets. Supplementary material for this article is available online. 相似文献
12.
Y.Y. Wu C.K. Chan L.X. Zhou 《Journal of Computational and Applied Mathematics》2011,235(13):3768-3774
Large eddy simulation (LES) using a dynamic eddy viscosity subgrid scale stress model and a fast-chemistry combustion model without accounting for the finite-rate chemical kinetics is applied to study the ignition and propagation of a turbulent premixed V-flame. A progress variable c-equation is applied to describe the flame front propagation. The equations are solved two dimensionally by a projection-based fractional step method for low Mach number flows. The flow field with a stabilizing rod without reaction is first obtained as the initial field and ignition happens just upstream of the stabilizing rod. The shape of the flame is affected by the velocity field, and following the flame propagation, the vortices fade and move to locations along the flame front. The LES computed time-averaged velocity agrees well with data obtained from experiments. 相似文献
13.
Qihua Wang 《Journal of multivariate analysis》2003,85(2):234-252
Consider partial linear models of the form Y=Xτβ+g(T)+e with Y measured with error and both p-variate explanatory X and T measured exactly. Let
be the surrogate variable for Y with measurement error. Let primary data set be that containing independent observations on
and the validation data set be that containing independent observations on
, where the exact observations on Y may be obtained by some expensive or difficult procedures for only a small subset of subjects enrolled in the study. In this paper, without specifying any structure equations and distribution assumption of Y given
, a semiparametric dimension reduction technique is employed to obtain estimators of β and g(·) based the least squared method and kernel method with the primary data and validation data. The proposed estimators of β are proved to be asymptotically normal, and the estimator for g(·) is proved to be weakly consistent with an optimal convergent rate. 相似文献
14.
Let Um be an m×m Haar unitary matrix and U[m,n] be its n×n truncation. In this paper the large deviation is proven for the empirical eigenvalue density of U[m,n] as m/n→λ and n→∞. The rate function and the limit distribution are given explicitly. U[m,n] is the random matrix model of quq, where u is a Haar unitary in a finite von Neumann algebra, q is a certain projection and they are free. The limit distribution coincides with the Brown measure of the operator quq. 相似文献
15.
16.
Rajarshi Guhaniyogi Shaan Qamar David B. Dunson 《Journal of computational and graphical statistics》2018,27(3):657-672
We propose a conditional density filtering (C-DF) algorithm for efficient online Bayesian inference. C-DF adapts MCMC sampling to the online setting, sampling from approximations to conditional posterior distributions obtained by propagating surrogate conditional sufficient statistics (a function of data and parameter estimates) as new data arrive. These quantities eliminate the need to store or process the entire dataset simultaneously and offer a number of desirable features. Often, these include a reduction in memory requirements and runtime and improved mixing, along with state-of-the-art parameter inference and prediction. These improvements are demonstrated through several illustrative examples including an application to high dimensional compressed regression. In the cases where dimension of the model parameter does not grow with time, we also establish sufficient conditions under which C-DF samples converge to the target posterior distribution asymptotically as sampling proceeds and more data arrive. Supplementary materials of C-DF are available online. 相似文献
17.
Stuart Lipsitz Garrett Fitzmaurice Debajyoti Sinha Nathanael Hevelone Jim Hu Louis L. Nguyen 《Journal of computational and graphical statistics》2017,26(3):734-737
Medical studies increasingly involve a large sample of independent clusters, where the cluster sizes are also large. Our motivating example from the 2010 Nationwide Inpatient Sample (NIS) has 8,001,068 patients and 1049 clusters, with average cluster size of 7627. Consistent parameter estimates can be obtained naively assuming independence, which are inefficient when the intra-cluster correlation (ICC) is high. Efficient generalized estimating equations (GEE) incorporate the ICC and sum all pairs of observations within a cluster when estimating the ICC. For the 2010 NIS, there are 92.6 billion pairs of observations, making summation of pairs computationally prohibitive. We propose a one-step GEE estimator that (1) matches the asymptotic efficiency of the fully iterated GEE; (2) uses a simpler formula to estimate the ICC that avoids summing over all pairs; and (3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this article is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars. 相似文献
18.
《Optimization》2012,61(2):401-421
Abstract We study the efficient set X E for a multiple objective linear program by using its projection into the linear space L spanned by the independent criteria. We show that in the orthogonally complementary space of L, the efficient points form a polyhedron, while in L an efficiency-equivalent polyhedron for the projection P(X E ) of X E can be constructed by algorithms of outer and inner approximation types. These algorithms can be also used for generating all extreme points of P(X E ). Application to optimization over the efficient set for a multiple objective linear program is considered. 相似文献
19.
非等熵气体动力学系统Cauchy问题弱解全局存在性有两个公开问题:一个是包含真空的小初值问题,另一个是任意大初值问题.本文通过引入一个放缩框架证明了上述两个问题的等价性,即对于粘性消失解,其包含真空小初值问题的一致BV估计蕴含着任意大初值问题弱解的全局存在性.该放缩框架对大多数具有物理背景的双曲守恒律系统亦成立. 相似文献
20.
On the distribution of the length of the longest increasing subsequence of random permutations 总被引:18,自引:0,他引:18
Jinho Baik Percy Deift Kurt Johansson 《Journal of the American Mathematical Society》1999,12(4):1119-1178
The authors consider the length, , of the longest increasing subsequence of a random permutation of numbers. The main result in this paper is a proof that the distribution function for , suitably centered and scaled, converges to the Tracy-Widom distribution of the largest eigenvalue of a random GUE matrix. The authors also prove convergence of moments. The proof is based on the steepest descent method for Riemann-Hilbert problems, introduced by Deift and Zhou in 1993 in the context of integrable systems. The applicability of the Riemann-Hilbert technique depends, in turn, on the determinantal formula of Gessel for the Poissonization of the distribution function of .