首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Fitting semiparametric clustering models to dissimilarity data   总被引:1,自引:0,他引:1  
The cluster analysis problem of partitioning a set of objects from dissimilarity data is here handled with the statistical model-based approach of fitting the “closest” classification matrix to the observed dissimilarities. A classification matrix represents a clustering structure expressed in terms of dissimilarities. In cluster analysis there is a lack of methodologies widely used to directly partition a set of objects from dissimilarity data. In real applications, a hierarchical clustering algorithm is applied on dissimilarities and subsequently a partition is chosen by visual inspection of the dendrogram. Alternatively, a “tandem analysis” is used by first applying a Multidimensional Scaling (MDS) algorithm and then by using a partitioning algorithm such as k-means applied on the dimensions specified by the MDS. However, neither the hierarchical clustering algorithms nor the tandem analysis is specifically defined to solve the statistical problem of fitting the closest partition to the observed dissimilarities. This lack of appropriate methodologies motivates this paper, in particular, the introduction and the study of three new object partitioning models for dissimilarity data, their estimation via least-squares and the introduction of three new fast algorithms.  相似文献   

2.
This paper studies the behavior of teams competing within soccer national leagues. The dissimilarities between teams are measured using the match results at each round and that information feeds a multidimensional scaling (MDS) algorithm for visualizing teams’ performance. Data characterizing four European leagues during season 2014–2015 is adopted and processed using three distinct approaches. In the first, one dissimilarity matrix and one MDS map per round are generated. After, Procrustes analysis is applied to linearly transform the MDS charts for maximum superposition and to build one global MDS representation for the whole season. In the second approach, all data is combined into one dissimilarity matrix leading to a single global MDS chart. In the third approach, the results per round are used to generate time series for all teams. Then, the time series are compared, generating a dissimilarity matrix and the corresponding MDS map. In all cases, the points on the maps represent teams state up to a given round. The set of points corresponding to each team forms a locus representative of its performance versus time.  相似文献   

3.
The Quadratic Assignment Problem (QAP) is known as one of the most difficult problems within combinatorial optimization. It is used to model many practical problems including different layout problems. The main topic of this paper is to provide methods to check whether a particular instance of the QAP is a layout problem. An instance is a layout problem if the distances of the objects can be reconstructed on the plane and/or in the 3-dimensional space. A new mixed integer programming model is suggested for the case if the distances of the objects are supposed to be rectilinear distances. If the distances are Euclidean distances then the use of the well-known Multi-Dimensional Scaling (MDS) method of statistics is suggested for reconstruction purposes. The well-known difficulty of QAP makes it a popular and suitable experimental field for many algorithmic ideas including artificial intelligence methods. These types of results are published sometimes as layout problems. The methods of reconstruction can be used to decide whether the topic of a paper is layout or only general QAP. The issue what the OR community should expect from AI based algorithms, is also addressed.  相似文献   

4.
Abstract

Projection pursuit describes a procedure for searching high-dimensional data for “interesting” low-dimensional projections via the optimization of a criterion function called the projection pursuit index. By empirically examining the optimization process for several projection pursuit indexes, we observed differences in the types of structure that maximized each index. We were especially curious about differences between two indexes based on expansions in terms of orthogonal polynomials, the Legendre index, and the Hermite index. Being fast to compute, these indexes are ideally suited for dynamic graphics implementations.

Both Legendre and Hermite indexes are weighted L 2 distances between the density of the projected data and a standard normal density. A general form for this type of index is introduced that encompasses both indexes. The form clarifies the effects of the weight function on the index's sensitivity to differences from normality, highlighting some conceptual problems with the Legendre and Hermite indexes. A new index, called the Natural Hermite index, which alleviates some of these problems, is introduced.

A polynomial expansion of the data density reduces the form of the index to a sum of squares of the coefficients used in the expansion. This drew our attention to examining these coefficients as indexes in their own right. We found that the first two coefficients, and the lowest-order indexes produced by them, are the most useful ones for practical data exploration because they respond to structure that can be analytically identified, and because they have “long-sighted” vision that enables them to “see” large structure from a distance. Complementing this low-order behavior, the higher-order indexes are “short-sighted.” They are able to see intricate structure, but only when they are close to it.

We also show some practical use of projection pursuit using the polynomial indexes, including a discovery of previously unseen structure in a set of telephone usage data, and two cautionary examples which illustrate that structure found is not always meaningful.  相似文献   

5.
本文给出了分析多个相异性矩阵的三种方法.首先找到了一种图表示,使我们对所有相异性矩阵有一个总体的了解;其次定义了一个新的相异性矩阵,它可以看作是对所有原始相异性矩阵的一个折衷处理;最后提出了一种MIMU方法.在文中我们还对由上述方法得到的坐标图进行了比较.  相似文献   

6.
A dimension reduction method based on the “Nonlinear Level set Learning” (NLL) approach is presented for the pointwise prediction of functions which have been sparsely sampled. Leveraging geometric information provided by the Implicit Function Theorem, the proposed algorithm effectively reduces the input dimension to the theoretical lower bound with minor accuracy loss, providing a one-dimensional representation of the function which can be used for regression and sensitivity analysis. Experiments and applications are presented which compare this modified NLL with the original NLL and the Active Subspaces (AS) method. While accommodating sparse input data, the proposed algorithm is shown to train quickly and provide a much more accurate and informative reduction than either AS or the original NLL on two example functions with high-dimensional domains, as well as two state-dependent quantities depending on the solutions to parametric differential equations.  相似文献   

7.
A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.

An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl 1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables.

We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.  相似文献   

8.
Abstract

We are interested in the exploratory analysis of large collections of complex objects. As an example, we are studying a large collection of digital images that has nearly 30,000 members. We regard each image in the collection as an individual observation. To facilitate our study we construct an index of the images in the collection. The index uses a small copy of each image (an icon or a “thumbnail”) to represent the full-size version. A large number of these thumbnails are laid out in a workstation window. We can interactively arrange and rearrange the thumbnails within the window. For example, we can sort the thumbnails by the values of a function computed from them or by the values of data associated with each of them. By the use of specialized equipment (a single-frame video disk recorder/player), we can instantly access any individual full-size image in the collection as a video image. We regard our software as an early development of statistical exploratory tools for studying collections of images and other complex objects in the same way we routinely study batches of numbers. We expect that the concept of a visual index will extend to other collections of complex objects besides images, for example, time series, functions, and text.  相似文献   

9.
The aim of a spatial classification is to position the units on a spatial network and to give simultaneously a set of structured classes of these units “compatible” with the network. We introduce the basic needed definitions: compatibility between a classification structure and a tessellation, (m,k)-networks as a case of tessellation, convex, maximal and connected subsets in such networks, spatial pyramids and spatial hierarchies. As like Robinsonian dissimilarities induced by indexed pyramids generalize ultrametrics induced by indexed hierarchies we show that a new kind of dissimilarity called “Yadidean” induced by spatial pyramids generalize Robinsonian dissimilarities. We focus on spatial pyramids where each class is a convex for a grid, and we show that there are several one-to-one correspondences with different kinds of Yadidean dissimilarities. These new results produce also, as a special case, several one-to-one correspondences between spatial hierarchies (resp. standard indexed pyramids) and Yadidean ultrametrics (resp. Robinsonian) dissimilarities. Qualities of spatial pyramids and their supremum under a given dissimilarity are considered. We give a constructive algorithm for convex spatial pyramids illustrated by an example. We show finally by a simple example that spatial pyramids on symbolic data can produce a geometrical representation of conceptual lattices of “symbolic objects”.  相似文献   

10.
The Graph Level Order Unary Degree Sequence (GLOUDS) is a new succinct data structure for directed graphs that are “tree-like,” in the sense that the number of “additional” edges (w.r.t. a spanning tree) is not too high. The algorithmic idea is to represent a BFS-spanning tree of the graph (consisting of n nodes) with a well known succinct data structure for trees, named LOUDS, and enhance it with additional information that accounts for the non-tree edges. In practical tests, our data structure performs well for graphs containing up to m=5n edges, while still having competitive running times for listing adjacent nodes.  相似文献   

11.
Displaying the component-wise between-group differences high-dimensional datasets is problematic because widely used plots such as Bland–Altman and Volcano plots do not show what they are colloquially believed to show. Thus, it is difficult for the experimentalist to grasp why the between-group difference of one component is “significant” while that of another component is not. Here, we propose a type of “Effect Plot” that displays between-group differences in relation to respective underlying variability for every component of a high-dimensional dataset. We use synthetic data to show that such a plot captures the essence of what determines “significance” for between-group differences in each component, and provide guidance in the interpretation of the plot. Supplementary online materials contain the code and data for this article and include simple R functions to produce an effect plot from suitable datasets.  相似文献   

12.
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.  相似文献   

13.
Methods for analyzing or learning from “fuzzy data” have attracted increasing attention in recent years. In many cases, however, existing methods (for precise, non-fuzzy data) are extended to the fuzzy case in an ad-hoc manner, and without carefully considering the interpretation of a fuzzy set when being used for modeling data. Distinguishing between an ontic and an epistemic interpretation of fuzzy set-valued data, and focusing on the latter, we argue that a “fuzzification” of learning algorithms based on an application of the generic extension principle is not appropriate. In fact, the extension principle fails to properly exploit the inductive bias underlying statistical and machine learning methods, although this bias, at least in principle, offers a means for “disambiguating” the fuzzy data. Alternatively, we therefore propose a method which is based on the generalization of loss functions in empirical risk minimization, and which performs model identification and data disambiguation simultaneously. Elaborating on the fuzzification of specific types of losses, we establish connections to well-known loss functions in regression and classification. We compare our approach with related methods and illustrate its use in logistic regression for binary classification.  相似文献   

14.
The joint optimization of fidelity and commensurability (JOFC) manifold matching methodology embeds an omnibus dissimilarity matrix consisting of multiple dissimilarities on the same set of objects. One approach to this embedding optimizes the preservation of fidelity to each individual dissimilarity matrix together with commensurability of each given observation across modalities via iterative majorization of a raw stress error criterion by successive Guttman transforms. In this article, we exploit the special structure inherent to JOFC to exactly and efficiently compute the successive Guttman transforms, and as a result we are able to greatly speed up the JOFC procedure for both in-sample and out-of-sample embedding. We demonstrate the scalability of our implementation on both real and simulated data examples.  相似文献   

15.
16.
In this paper, we propose a Ward-like hierarchical clustering algorithm including spatial/geographical constraints. Two dissimilarity matrices \(D_0\) and \(D_1\) are inputted, along with a mixing parameter \(\alpha \in [0,1]\). The dissimilarities can be non-Euclidean and the weights of the observations can be non-uniform. The first matrix gives the dissimilarities in the “feature space” and the second matrix gives the dissimilarities in the “constraint space”. The criterion minimized at each stage is a convex combination of the homogeneity criterion calculated with \(D_0\) and the homogeneity criterion calculated with \(D_1\). The idea is then to determine a value of \(\alpha \) which increases the spatial contiguity without deteriorating too much the quality of the solution based on the variables of interest i.e. those of the feature space. This procedure is illustrated on a real dataset using the R package ClustGeo.  相似文献   

17.
Estimating the number of clusters is one of the most difficult problems in cluster analysis. Most previous approaches require knowing the data matrix and may not work when only a Euclidean distance matrix is available. Other approaches also suffer from the curse of dimensionality and work poorly in high dimension. In this article, we develop a new statistic, called the GUD statistic, based on the idea of the Gap method, but use the determinant of the pooled within-group scatter matrix instead of the within-cluster sum of squared distances. Some theory is developed to show this statistic can work well when only the Euclidean distance matrix is known. More generally, this statistic can even work for any dissimilarity matrix that satisfies some properties. We also propose a modification for high-dimensional datasets, called the R-GUD statistic, which can give a robust estimation in high-dimensional settings. The simulation shows our method needs less information but is generally found to be more accurate and robust than other methods considered in the study, especially in many difficult settings.  相似文献   

18.
For many systems characterized as “complex” the patterns exhibited on different scales differ markedly from one another. For example, the biomass distribution in a human body “looks very different” depending on the scale at which one examines it. Conversely, the patterns at different scales in “simple” systems (e.g., gases, mountains, crystals) vary little from one scale to another. Accordingly, the degrees of self‐dissimilarity between the patterns of a system at various scales constitute a complexity “signature” of that system. Here we present a novel quantification of self‐dissimilarity. This signature can, if desired, incorporate a novel information‐theoretic measure of the distance between probability distributions that we derive here. Whatever distance measure is chosen, our quantification of self‐dissimilarity can be measured for many kinds of real‐world data. This allows comparisons of the complexity signatures of wholly different kinds of systems (e.g., systems involving information density in a digital computer vs. species densities in a rain forest vs. capital density in an economy, etc.). Moreover, in contrast to many other suggested complexity measures, evaluating the self‐dissimilarity of a system does not require one to already have a model of the system. These facts may allow self‐dissimilarity signatures to be used as the underlying observational variables of an eventual overarching theory relating all complex systems. To illustrate self‐dissimilarity, we present several numerical experiments. In particular, we show that the underlying structure of the logistic map is picked out by the self‐dissimilarity signature of time series produced by that map. © 2007 Wiley Periodicals, Inc. Complexity 12: 77–85, 2007  相似文献   

19.
Quasi-ultrametric multi-way dissimilarities and their respective sets of k-balls extend the fundamental bijection in classification between ultrametric pairwise dissimilarities and indexed hierarchies. We show that nonempty Galois closed subsets of a finite entity set coincide with k-balls of some quasi-ultrametric multi-way dissimilarity. This result relates the order theoretic Galois connection based clustering approach to the dissimilarity based one. Moreover, it provides an effective way to specify easy-to-interpret cluster systems, from complex data sets, as well as to derive informative attribute implications.  相似文献   

20.
Inference for spatial generalized linear mixed models (SGLMMs) for high-dimensional non-Gaussian spatial data is computationally intensive. The computational challenge is due to the high-dimensional random effects and because Markov chain Monte Carlo (MCMC) algorithms for these models tend to be slow mixing. Moreover, spatial confounding inflates the variance of fixed effect (regression coefficient) estimates. Our approach addresses both the computational and confounding issues by replacing the high-dimensional spatial random effects with a reduced-dimensional representation based on random projections. Standard MCMC algorithms mix well and the reduced-dimensional setting speeds up computations per iteration. We show, via simulated examples, that Bayesian inference for this reduced-dimensional approach works well both in terms of inference as well as prediction; our methods also compare favorably to existing “reduced-rank” approaches. We also apply our methods to two real world data examples, one on bird count data and the other classifying rock types. Supplementary material for this article is available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号