首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Basing cluster analysis on mixture models has become a classical and powerful approach. It enables some classical criteria such as the well-known k-means criterion to be explained. To classify the rows or the columns of a contingency table, an adapted version of k-means known as Mndki2, which uses the chi-square distance, can be used. Unfortunately, this simple, effective method which can be used jointly with correspondence analysis based on the same representation of the data, cannot be associated with a mixture model in the same way as the classical k-means algorithm. In this paper we show that the Mndki2 algorithm can be viewed as an approximation of a classifying version of the EM algorithm for a mixture of multinomial distributions. A comparison of the algorithms belonging in this context are experimentally investigated using Monte Carlo simulations.  相似文献   

2.
《Journal of Complexity》2002,18(1):375-391
The process of partitioning a large set of patterns into disjoint and homogeneous clusters is fundamental in knowledge acquisition. It is called Clustering in the literature and it is applied in various fields including data mining, statistical data analysis, compression and vector quantization. The k-means is a very popular algorithm and one of the best for implementing the clustering process. The k-means has a time complexity that is dominated by the product of the number of patterns, the number of clusters, and the number of iterations. Also, it often converges to a local minimum. In this paper, we present an improvement of the k-means clustering algorithm, aiming at a better time complexity and partitioning accuracy. Our approach reduces the number of patterns that need to be examined for similarity, in each iteration, using a windowing technique. The latter is based on well known spatial data structures, namely the range tree, that allows fast range searches.  相似文献   

3.
Clustering multimodal datasets can be problematic when a conventional algorithm such as k-means is applied due to its implicit assumption of Gaussian distribution of the dataset. This paper proposes a tandem clustering process for multimodal data sets. The proposed method first divides the multimodal dataset into many small pre-clusters by applying k-means or fuzzy k-means algorithm. These pre-clusters are then clustered again by agglomerative hierarchical clustering method using Kullback–Leibler divergence as an initial measure of dissimilarity. Benchmark results show that the proposed approach is not only effective at extracting the multimodal clusters but also efficient in computational time and relatively robust at the presence of outliers.  相似文献   

4.
The paper gives a new interpretation and a possible optimization of the wellknown k-means algorithm for searching for a locally optimal partition of the set A = {a i ∈ ? n : i = 1, …, m} which consists of k disjoint nonempty subsets π1, …, π k , 1 ? k ? m. For this purpose, a new divided k-means algorithm was constructed as a limit case of the known smoothed k-means algorithm. It is shown that the algorithm constructed in this way coincides with the k-means algorithm if during the iterative procedure no data points appear in the Voronoi diagram. If in the partition obtained by applying the divided k-means algorithm there are data points lying in the Voronoi diagram, it is shown that the obtained result can be improved further.  相似文献   

5.
This paper describes the k-means range algorithm, a combination of the partitional k-means clustering algorithm with a well known spatial data structure, namely the range tree, which allows fast range searches. It offers a real-time solution for the development of distributed interactive decision aids in e-commerce since it allows the consumer to model his preferences along multiple dimensions, search for product information, and then produce the data clusters of the products retrieved to enhance his purchase decisions. This paper also discusses the implications and advantages of this approach in the development of on-line shopping environments and consumer decision aids in traditional and mobile e-commerce applications.  相似文献   

6.
Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.  相似文献   

7.
The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm.  相似文献   

8.
The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.  相似文献   

9.
《Computational Geometry》1999,12(1-2):125-152
The visual nature of geometry applications makes them a natural area where visualization can be an effective tool for demonstrating algorithms. In this paper we propose a new model, called Mocha, for interactive visualization of algorithms over the World Wide Web. Mocha is a distributed model with a client-server architecture that optimally partitions the software components of a typical algorithm execution and visualization system, and leverages the power of the Java language, which has become the standard for distributing interactive platform-independent applications across the Web. Mocha provides high levels of security, protects the algorithm code, places a light communication load on the Internet, and allows users with limited computing resources to access executions of computationally expensive algorithms. The user interface combines fast responsiveness with the powerful authoring capabilities of hypertext narratives.We describe the architecture of Mocha, show its advantages over previous methods, and present a prototype that can be accessed by any user with a Java-enabled Web browser. The Mocha prototype has been widely accessed over the Web, as demonstrated by the statistics that we have collected, and the Mocha model has been adopted by other research groups. Mocha is currently part of a broader system, called GeomNet, which performs distributed geometric computing over the Internet.  相似文献   

10.
In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional “one-shot” data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.  相似文献   

11.
Clustering is a fundamental problem in many scientific applications. Standard methods such as k-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations of k-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work, we present two splitting methods for solving the convex clustering problem. The first is an instance of the alternating direction method of multipliers (ADMM); the second is an instance of the alternating minimization algorithm (AMA). In contrast to previously considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks for solving the convex clustering problem under the previously studied norms and open the door to potentially novel norms. We demonstrate the performance of our algorithm on both simulated and real data examples. While the differences between the two algorithms appear to be minor on the surface, complexity analysis and numerical experiments show AMA to be significantly more efficient. This article has supplementary materials available online.  相似文献   

12.
In this paper we provide evidence of the benefits of an approach which combines data mining and mathematical programming to determining the premium to charge automobile insurance policy holders in order to arrive at an optimal portfolio. An non-linear integer programming formulation is proposed to determine optimal premiums based on the insurer's need to find a balance between profitability and market share. The non-linear integer programming approach to solving this problem is used within a data mining framework which consists of three components: classifying policy holders into homogenous risk groups and predicting the claim cost of each group using k-means clustering; determining the price sensitivity (propensity to pay) of each group using neural networks; and combining the results of the first two components to determine the optimal premium to charge. We have earlier presented the results of the first two components. In this paper we present the results of the third component. Using our approach, we have been able to increase revenue without affecting termination rates and market share.  相似文献   

13.
We propose minimum volume ellipsoids (MVE) clustering as an alternative clustering technique to k-means for data clusters with ellipsoidal shapes and explore its value and practicality. MVE clustering allocates data points into clusters in a way that minimizes the geometric mean of the volumes of each cluster’s covering ellipsoids. Motivations for this approach include its scale-invariance, its ability to handle asymmetric and unequal clusters, and our ability to formulate it as a mixed-integer semidefinite programming problem that can be solved to global optimality. We present some preliminary empirical results that illustrate MVE clustering as an appropriate method for clustering data from mixtures of “ellipsoidal” distributions and compare its performance with the k-means clustering algorithm as well as the MCLUST algorithm (which is based on a maximum likelihood EM algorithm) available in the statistical package R. Research of the first author was supported in part by a Discovery Grant from NSERC and a research grant from Faculty of Mathematics, University of Waterloo. Research of the second author was supported in part by a Discovery Grant from NSERC and a PREA from Ontario, Canada.  相似文献   

14.
The family of expectation--maximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EM-type algorithms is time-consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called workers and managers for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of local E step. The managers perform the M step after they have received results from a γ-fraction of the workers, where γ is a fixed constant in (0, 1]. The sequence of parameter estimates generated by the DEM algorithm retains the attractive properties of EM: convergence of the sequence of parameter estimates to a local mode and linear global rate of convergence. Across diverse simulations focused on linear mixed-effects models, the DEM algorithm is significantly faster than competing EM-type algorithms while having a similar accuracy. The DEM algorithm maintains its superior empirical performance on a movie ratings database consisting of 10 million ratings. Supplementary material for this article is available online.  相似文献   

15.
A parallel asynchronous Newton algorithm for unconstrained optimization   总被引:1,自引:0,他引:1  
A new approach to the solution of unconstrained optimization problems is introduced. It is based on the exploitation of parallel computation techniques and in particular on an asynchronous communication model for the data exchange among concurrent processes. The proposed approach arises by interpreting the Newton method as being composed of a set of iterative and independent tasks that can be mapped onto a parallel computing system for the execution.Numerical experiments on the resulting algorithm have been carried out to compare parallel versions using synchronous and asynchronous communication mechanisms in order to assess the benefits of the proposed approach on a variety of parallel computing architectures. It is pointed out that the proposed asynchronous Newton algorithm is preferable for medium and large-scale problems, in the context of both distributed and shared memory architectures.This research work was partially supported by the National Research Council of Italy, within the special project Sistemi Informatici e Calcolo Parallelo, under CNR Contract No. 90.00675.PF69.  相似文献   

16.
In order to maintain load balancing in a distributed system, we should obtain workload information from all the nodes on network. This processing requiresO(v 2) communication overhead, wherev is the number of nodes. In this paper, we present a new synchronous dynamic distributed load balancing algorithm on a (v, k+1, 1)-configured network applying a symmetric balanced incomplete block design, wherev=k 2+k+1. Our algorithm needs only $O(v\sqrt v )$ communication overhead and each node receives workload information from all the nodes without redundancy. Therefore, load balancing is maintained since every link has the same amount of traffic for transferring workload information.  相似文献   

17.
In this paper, we prove that the maximum k-club problem (MkCP) defined on an undirected graph is NP-hard. We also give an integer programming formulation for this problem as well as an exact branch-and-bound algorithm and computational results on instances involving up to 200 vertices. Instances defined on very dense graphs can be solved to optimality within insignificant computing times. When k=2, the most difficult cases appear to be those where the graph density is around 0.15.  相似文献   

18.
Fitting semiparametric clustering models to dissimilarity data   总被引:1,自引:0,他引:1  
The cluster analysis problem of partitioning a set of objects from dissimilarity data is here handled with the statistical model-based approach of fitting the “closest” classification matrix to the observed dissimilarities. A classification matrix represents a clustering structure expressed in terms of dissimilarities. In cluster analysis there is a lack of methodologies widely used to directly partition a set of objects from dissimilarity data. In real applications, a hierarchical clustering algorithm is applied on dissimilarities and subsequently a partition is chosen by visual inspection of the dendrogram. Alternatively, a “tandem analysis” is used by first applying a Multidimensional Scaling (MDS) algorithm and then by using a partitioning algorithm such as k-means applied on the dimensions specified by the MDS. However, neither the hierarchical clustering algorithms nor the tandem analysis is specifically defined to solve the statistical problem of fitting the closest partition to the observed dissimilarities. This lack of appropriate methodologies motivates this paper, in particular, the introduction and the study of three new object partitioning models for dissimilarity data, their estimation via least-squares and the introduction of three new fast algorithms.  相似文献   

19.
A synchronous concurrent algorithm (SCA) is a parallel deterministic algorithm based on a network of modules and channels, computing and communicating data in parallel, and synchronised by a global clock with discrete time. Many types of algorithms, computer architectures, and mathematical models of physical and biological systems are examples of SCAs. For example, conventional digital hardware is made from components that are SCAs and many computational models possess the essential features of SCAs, including systolic arrays, neural networks, cellular automata and coupled map lattices.In this paper we formalise the general concept of an SCA equipped with a global clock in order to analyse precisely (i) specifications of their spatio-temporal behaviour; and (ii) the senses in which the algorithms are correct. We start the mathematical study of SCA computation, specification and correctness using methods based on computation on many-sorted topological algebras and equational logic. We show that specifications can be given equationally and, hence, that the correctness of SCAs can be reduced to the validity of equations in certain computable algebras. Since the idea of an SCA is general, our methods and results apply to each of the particular classes of algorithms and dynamical systems above.  相似文献   

20.
A variety of iterative clustering algorithms require an initial partition of a dataset as an input parameter. As a rule a good choice of the initial partition is essential for building a high quality final partition. In this note, we generate initial partitions by using small samples of the data. Numerical experiments with k-means like clustering algorithms are reported.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号