首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We discuss methodology for multidimensional scaling (MDS) and its implementation in two software systems, GGvis and XGvis. MDS is a visualization technique for proximity data, that is, data in the form of N × N dissimilarity matrices. MDS constructs maps (“configurations,” “embeddings”) in IRk by interpreting the dissimilarities as distances. Two frequent sources of dissimilarities are high-dimensional data and graphs. When the dissimilarities are distances between high-dimensional objects, MDS acts as a (often nonlinear) dimension-reduction technique. When the dissimilarities are shortest-path distances in a graph, MDS acts as a graph layout technique. MDS has found recent attention in machine learning motivated by image databases (“Isomap”). MDS is also of interest in view of the popularity of “kernelizing” approaches inspired by Support Vector Machines (SVMs; “kernel PCA”).

This article discusses the following general topics: (1) the stability and multiplicity of MDS solutions; (2) the analysis of structure within and between subsets of objects with missing value schemes in dissimilarity matrices; (3) gradient descent for optimizing general MDS loss functions (“Strain” and “Stress”); (4) a unification of classical (Strain-based) and distance (Stress-based) MDS.

Particular topics include the following: (1) blending of automatic optimization with interactive displacement of configuration points to assist in the search for global optima; (2) forming groups of objects with interactive brushing to create patterned missing values in MDS loss functions; (3) optimizing MDS loss functions for large numbers of objects relative to a small set of anchor points (“external unfolding”); and (4) a non-metric version of classical MDS.

We show applications to the mapping of computer usage data, to the dimension reduction of marketing segmentation data, to the layout of mathematical graphs and social networks, and finally to the spatial reconstruction of molecules.  相似文献   

2.
Quasi-ultrametric multi-way dissimilarities and their respective sets of k-balls extend the fundamental bijection in classification between ultrametric pairwise dissimilarities and indexed hierarchies. We show that nonempty Galois closed subsets of a finite entity set coincide with k-balls of some quasi-ultrametric multi-way dissimilarity. This result relates the order theoretic Galois connection based clustering approach to the dissimilarity based one. Moreover, it provides an effective way to specify easy-to-interpret cluster systems, from complex data sets, as well as to derive informative attribute implications.  相似文献   

3.
The partitioning clustering is a technique to classify n objects into k disjoint clusters, and has been developed for years and widely used in many applications. In this paper, a new overlapping cluster algorithm is defined. It differs from traditional clustering algorithms in three respects. First, the new clustering is overlapping, because clusters are allowed to overlap with one another. Second, the clustering is non-exhaustive, because an object is permitted to belong to no cluster. Third, the goals considered in this research are the maximization of the average number of objects contained in a cluster and the maximization of the distances among cluster centers, while the goals in previous research are the maximization of the similarities of objects in the same clusters and the minimization of the similarities of objects in different clusters. Furthermore, the new clustering is also different from the traditional fuzzy clustering, because the object–cluster relationship in the new clustering is represented by a crisp value rather than that represented by using a fuzzy membership degree. Accordingly, a new overlapping partitioning cluster (OPC) algorithm is proposed to provide overlapping and non-exhaustive clustering of objects. Finally, several simulation and real world data sets are used to evaluate the effectiveness and the efficiency of the OPC algorithm, and the outcomes indicate that the algorithm can generate satisfactory clustering results.  相似文献   

4.
Clustering multimodal datasets can be problematic when a conventional algorithm such as k-means is applied due to its implicit assumption of Gaussian distribution of the dataset. This paper proposes a tandem clustering process for multimodal data sets. The proposed method first divides the multimodal dataset into many small pre-clusters by applying k-means or fuzzy k-means algorithm. These pre-clusters are then clustered again by agglomerative hierarchical clustering method using Kullback–Leibler divergence as an initial measure of dissimilarity. Benchmark results show that the proposed approach is not only effective at extracting the multimodal clusters but also efficient in computational time and relatively robust at the presence of outliers.  相似文献   

5.
This paper studies the behavior of teams competing within soccer national leagues. The dissimilarities between teams are measured using the match results at each round and that information feeds a multidimensional scaling (MDS) algorithm for visualizing teams’ performance. Data characterizing four European leagues during season 2014–2015 is adopted and processed using three distinct approaches. In the first, one dissimilarity matrix and one MDS map per round are generated. After, Procrustes analysis is applied to linearly transform the MDS charts for maximum superposition and to build one global MDS representation for the whole season. In the second approach, all data is combined into one dissimilarity matrix leading to a single global MDS chart. In the third approach, the results per round are used to generate time series for all teams. Then, the time series are compared, generating a dissimilarity matrix and the corresponding MDS map. In all cases, the points on the maps represent teams state up to a given round. The set of points corresponding to each team forms a locus representative of its performance versus time.  相似文献   

6.
In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.  相似文献   

7.
In the cluster analysis problem one seeks to partition a finite set of objects into disjoint groups (or clusters) such that each group contains relatively similar objects and, relatively dissimilar objects are placed in different groups. For certain classes of the problem or, under certain assumptions, the partitioning exercise can be formulated as a sequence of linear programs (LPs), each with a parametric objective function. Such LPs can be solved using the parametric linear programming procedure developed by Gass and Saaty [(Gass, S., Saaty, T. (1955), Naval Research Logistics Quarterly 2, 39–45)]. In this paper, a parametric linear programming model for solving cluster analysis problems is presented. We show how this model can be used to find optimal solutions for certain variations of the clustering problem or, in other cases, for an approximation of the general clustering problem.  相似文献   

8.
The aim of a spatial classification is to position the units on a spatial network and to give simultaneously a set of structured classes of these units “compatible” with the network. We introduce the basic needed definitions: compatibility between a classification structure and a tessellation, (m,k)-networks as a case of tessellation, convex, maximal and connected subsets in such networks, spatial pyramids and spatial hierarchies. As like Robinsonian dissimilarities induced by indexed pyramids generalize ultrametrics induced by indexed hierarchies we show that a new kind of dissimilarity called “Yadidean” induced by spatial pyramids generalize Robinsonian dissimilarities. We focus on spatial pyramids where each class is a convex for a grid, and we show that there are several one-to-one correspondences with different kinds of Yadidean dissimilarities. These new results produce also, as a special case, several one-to-one correspondences between spatial hierarchies (resp. standard indexed pyramids) and Yadidean ultrametrics (resp. Robinsonian) dissimilarities. Qualities of spatial pyramids and their supremum under a given dissimilarity are considered. We give a constructive algorithm for convex spatial pyramids illustrated by an example. We show finally by a simple example that spatial pyramids on symbolic data can produce a geometrical representation of conceptual lattices of “symbolic objects”.  相似文献   

9.
Sequential clustering aims at determining homogeneous and/or well-separated clusters within a given set of entities, one at a time, until no more such clusters can be found. We consider a bi-criterion sequential clustering problem in which the radius of a cluster (or maximum dissimilarity between an entity chosen as center and any other entity of the cluster) is chosen as a homogeneity criterion and the split of a cluster (or minimum dissimilarity between an entity in the cluster and one outside of it) is chosen as a separation criterion. An O(N 3) algorithm is proposed for determining radii and splits of all efficient clusters, which leads to an O(N 4) algorithm for bi-criterion sequential clustering with radius and split as criteria. This algorithm is illustrated on the well known Ruspini data set.  相似文献   

10.
In this paper, we propose a Ward-like hierarchical clustering algorithm including spatial/geographical constraints. Two dissimilarity matrices \(D_0\) and \(D_1\) are inputted, along with a mixing parameter \(\alpha \in [0,1]\). The dissimilarities can be non-Euclidean and the weights of the observations can be non-uniform. The first matrix gives the dissimilarities in the “feature space” and the second matrix gives the dissimilarities in the “constraint space”. The criterion minimized at each stage is a convex combination of the homogeneity criterion calculated with \(D_0\) and the homogeneity criterion calculated with \(D_1\). The idea is then to determine a value of \(\alpha \) which increases the spatial contiguity without deteriorating too much the quality of the solution based on the variables of interest i.e. those of the feature space. This procedure is illustrated on a real dataset using the R package ClustGeo.  相似文献   

11.
Sub-dominant theory provides efficient tools for clustering. However, it classically works only for ultrametrics and ad hoc extensions like Jardine and Sibson's 2-ultrametrics. In this paper we study the extension of the notion of sub-dominant to other distance models in classification accounting for overlapping clusters.We prove that a given dissimilarity admits one and only one lower-maximal quasi-ultrametric and one and only one lower-maximal weak k-ultrametric. In addition, we also prove the existence of (several) lower-maximal strongly Robinsonian dissimilarities. The construction of the lower-maximal weak k-ultrametric (for k=2) and quasi-ultrametric can be performed in polynomial time.  相似文献   

12.
A Structured Family of Clustering and Tree Construction Methods   总被引:1,自引:0,他引:1  
A cluster A is an Apresjan cluster if every pair of objects within A is more similar than either is to any object outside A. The criterion is intuitive, compelling, but often too restrictive for applications in classification. We therefore explore extensions of Apresjan clustering to a family of related hierarchical clustering methods. The extensions are shown to be closely connected with the well-known single and average linkage tree constructions. A dual family of methods for classification by splits is also presented. Splits are partitions of the set of objects into two disjoint blocks and are widely used in domains such as phylogenetics. Both the cluster and split methods give rise to progressively refined tree representations. We exploit dualities and connections between the various methods, giving polynomial time construction algorithms for most of the constructions and NP-hardness results for the rest.  相似文献   

13.
A set X is said to properly intersect a set Y if none of the sets XY, X?Y and Y?X is empty. In this paper, we consider collections of subsets such that each member of the collection properly intersects at most one other member. Such collections are hereafter called paired hierarchical collections. The two following combinatorial properties are investigated. First, any paired hierarchical collection is a set of intervals of at least one linear order defined on the ground set. Next, the maximum size of a paired hierarchical collection defined on an n-element set is . The properties of these collections are also investigated from the cluster analysis point of view. In the framework of the general bijection defined by Batbedat [Les isomorphismes HTS et HTE (après la bijection de Benzécri-Johnson), Metron 46 (1988) 47-59] and Bertrand [Set systems and dissimilarities, European J. Combin. 21 (2000) 727-743], we characterize the dissimilarities that are induced by weakly indexed paired hierarchical collections. Finally, we propose a proof of the so-called agglomerative paired hierarchical clustering (APHC) algorithm that extends the well-known AHC algorithm in order to allow that some clusters can be merged twice. An implementation and some illustrations of this algorithm and of a variant of it were presented by Chelcea et al. [A new agglomerative 2-3 hierarchical clustering algorithm, in: D. Baier, K.-D. Wernecke (Eds.), Innovations in Classification, Data Science, and Information Systems (GfKL 2003), Springer, Berlin, 2004, pp. 3-10 and Un Nouvel Algorithme de Classification Ascendante 2-3 Hiérarchique, in: Reconnaissance des Formes et Intelligence Artificielle (RFIA 2004), vol. 3, Toulouse, France, 2004, pp. 1471-1480. Available at 〈http://www.laas.fr/rfia2004/actes/ARTICLES/388.pdf〉].  相似文献   

14.
To find optimal clusters of functional objects in a lower-dimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the k-means criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.  相似文献   

15.
《Fuzzy Sets and Systems》2007,158(19):2095-2117
Cluster analysis aims at identifying groups of similar objects, and helps to discover distribution of patterns and interesting correlations in large data sets. Especially, fuzzy clustering has been widely studied and applied in a variety of key areas and fuzzy cluster validation plays a very important role in fuzzy clustering. This paper introduces the fundamental concepts of cluster validity, and presents a review of fuzzy cluster validity indices available in the literature. We conducted extensive comparisons of the mentioned indices in conjunction with the Fuzzy C-Means clustering algorithm on a number of widely used data sets, and make a simple analysis of the experimental results.  相似文献   

16.
Traditional c-means clustering partitions a group of objects into a number of non-overlapping sets. Rough sets provide more flexible and objective representation than classical sets with hard partition and fuzzy sets with subjective membership function for a given dataset. Rough c-means clustering and its extensions were introduced and successfully applied in many real life applications in recent years. Each cluster is represented by a reasonable pair of lower and upper approximations. However, the most available algorithms pay no attention to the influence of the imbalanced spatial distribution within a cluster. The limitation of the mean iterative calculation function, with the same weight for all the data objects in a lower or upper approximation, is analyzed. A hybrid imbalanced measure of distance and density for the rough c-means clustering is defined, and a modified rough c-means clustering algorithm is presented in this paper. To evaluate the proposed algorithm, it has been applied to several real world data sets from UCI. The validity of this algorithm is demonstrated by the results of comparative experiments.  相似文献   

17.
Fuzzy c-means clustering algorithm (FCM) can provide a non-parametric and unsupervised approach to the cluster analysis of data. Several efforts of fuzzy clustering have been undertaken by Bezdek and other researchers. Earlier studies in this field have reported problems due to the setting of optimum initial condition, cluster validity measure, and high computational load. More recently, the fuzzy clustering has benefited of a synergistic approach with Genetic Algorithms (GA) that play the role of an useful optimization technique that helps to better tolerate some classical drawbacks, such as sensitivity to initialization, noise and outliers, and susceptibility to local minima. We propose a genetic-level clustering methodology able to cluster objects represented by R p spaces. The unsupervised cluster algorithm, called SFCM (Spatial Fuzzy c-Means), is based on a fuzzy clustering c-means method that searches the best fuzzy partition of the universe assuming that the evaluation of each object with respect to some features is unknown, but knowing that it belongs to circular regions of R 2 space. Next we present a Java implementation of the algorithm, which provides a complete and efficient visual interaction for the setting of the parameters involved into the system. To demonstrate the applications of SFCM, we discuss a case study where it is shown the generality of our model by treating a simple 3-way data fuzzy clustering as example of a multicriteria optimization problem.  相似文献   

18.
Cluster collections obtained within the framework of most cluster structures studied in data analysis and classification are essentially Moore families. In this paper, we propose a simple intuitive necessary and sufficient condition for some subset of objects to be a critical set of a finite Moore family. This condition is based on a new characterization of quasi-closed sets. Moreover, we provide a necessary condition for a subset containing more than k objects (k ≥ 2) to be a critical set of a k-weakly hierarchical Moore family. Finally, as a consequence of this result, we identify critical sets of some k-weakly hierarchical Moore families and thereby generalize a result earlier obtained by Domenach and Leclerc in the particular case of weak hierarchies.  相似文献   

19.
The classification of short-term hospitals into homogeneous groups has become an integral part of many systems designed to abate continuing cost inflation in the hospital industry. This paper describes one approach which was developed to identify homogeneous groups of short-term hospitals. The approach, based on hierarchical cluster analysis, defines an objective measure (called expected distinctiveness) to evaluate any group of hospitals identified by a hierarchical grouping structure or dendrogram. Using this measure, an efficient algorithm is developed which finds the hospital partition from the identified groups which maximizes total expected distinctiveness. A numerical example illustrates the application and extensions.  相似文献   

20.
Two robustness criteria are presented that are applicable to general clustering methods. Robustness and stability in cluster analysis are not only data dependent, but even cluster dependent. Robustness is in the present paper defined as a property of not only the clustering method, but also of every individual cluster in a data set. The main principles are: (a) dissimilarity measurement of an original cluster with the most similar cluster in the induced clustering obtained by adding data points, (b) the dissolution point, which is an adaptation of the breakdown point concept to single clusters, (c) isolation robustness: given a clustering method, is it possible to join, by addition of g points, arbitrarily well separated clusters?Results are derived for k-means, k-medoids (k estimated by average silhouette width), trimmed k-means, mixture models (with and without noise component, with and without estimation of the number of clusters by BIC), single and complete linkage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号