首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 890 毫秒
1.
《Journal of Complexity》2002,18(1):375-391
The process of partitioning a large set of patterns into disjoint and homogeneous clusters is fundamental in knowledge acquisition. It is called Clustering in the literature and it is applied in various fields including data mining, statistical data analysis, compression and vector quantization. The k-means is a very popular algorithm and one of the best for implementing the clustering process. The k-means has a time complexity that is dominated by the product of the number of patterns, the number of clusters, and the number of iterations. Also, it often converges to a local minimum. In this paper, we present an improvement of the k-means clustering algorithm, aiming at a better time complexity and partitioning accuracy. Our approach reduces the number of patterns that need to be examined for similarity, in each iteration, using a windowing technique. The latter is based on well known spatial data structures, namely the range tree, that allows fast range searches.  相似文献   

2.
The paper gives a new interpretation and a possible optimization of the wellknown k-means algorithm for searching for a locally optimal partition of the set A = {a i ∈ ? n : i = 1, …, m} which consists of k disjoint nonempty subsets π1, …, π k , 1 ? k ? m. For this purpose, a new divided k-means algorithm was constructed as a limit case of the known smoothed k-means algorithm. It is shown that the algorithm constructed in this way coincides with the k-means algorithm if during the iterative procedure no data points appear in the Voronoi diagram. If in the partition obtained by applying the divided k-means algorithm there are data points lying in the Voronoi diagram, it is shown that the obtained result can be improved further.  相似文献   

3.
Clustering multimodal datasets can be problematic when a conventional algorithm such as k-means is applied due to its implicit assumption of Gaussian distribution of the dataset. This paper proposes a tandem clustering process for multimodal data sets. The proposed method first divides the multimodal dataset into many small pre-clusters by applying k-means or fuzzy k-means algorithm. These pre-clusters are then clustered again by agglomerative hierarchical clustering method using Kullback–Leibler divergence as an initial measure of dissimilarity. Benchmark results show that the proposed approach is not only effective at extracting the multimodal clusters but also efficient in computational time and relatively robust at the presence of outliers.  相似文献   

4.
The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm.  相似文献   

5.
Basing cluster analysis on mixture models has become a classical and powerful approach. It enables some classical criteria such as the well-known k-means criterion to be explained. To classify the rows or the columns of a contingency table, an adapted version of k-means known as Mndki2, which uses the chi-square distance, can be used. Unfortunately, this simple, effective method which can be used jointly with correspondence analysis based on the same representation of the data, cannot be associated with a mixture model in the same way as the classical k-means algorithm. In this paper we show that the Mndki2 algorithm can be viewed as an approximation of a classifying version of the EM algorithm for a mixture of multinomial distributions. A comparison of the algorithms belonging in this context are experimentally investigated using Monte Carlo simulations.  相似文献   

6.
Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operation which hinders the scalability of the approach. This work studies a different parallel formulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature of the centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.  相似文献   

7.
Recently, Bandeira (C R Math, 2015) introduced a new type of algorithm (the so-called probably certifiably correct algorithm) that combines fast solvers with the optimality certificates provided by convex relaxations. In this paper, we devise such an algorithm for the problem of k-means clustering. First, we prove that Peng and Wei’s semidefinite relaxation of k-means Peng and Wei (SIAM J Optim 18(1):186–205, 2007) is tight with high probability under a distribution of planted clusters called the stochastic ball model. Our proof follows from a new dual certificate for integral solutions of this semidefinite program. Next, we show how to test the optimality of a proposed k-means solution using this dual certificate in quasilinear time. Finally, we analyze a version of spectral clustering from Peng and Wei (SIAM J Optim 18(1):186–205, 2007) that is designed to solve k-means in the case of two clusters. In particular, we show that this quasilinear-time method typically recovers planted clusters under the stochastic ball model.  相似文献   

8.
In this paper, we apply nonlinear techniques (Self-Organizing Maps, k-nearest neighbors and the k-means algorithm) to evaluate the official Spanish mutual funds classification. The methodology that we propose allows us to identify which mutual funds are misclassified in the sense that they have historical performances which do not conform to the investment objectives established in their official category. According to this, we conclude that, on average, over 40% of mutual funds could be misclassified. Then, we propose an alternative classification, based on a double-step methodology, and we find that it achieves a significantly lower rate of misclassifications. The portfolios obtained from this alternative classification also attain better performances in terms of return/risk and include a smaller number of assets.  相似文献   

9.
Two robustness criteria are presented that are applicable to general clustering methods. Robustness and stability in cluster analysis are not only data dependent, but even cluster dependent. Robustness is in the present paper defined as a property of not only the clustering method, but also of every individual cluster in a data set. The main principles are: (a) dissimilarity measurement of an original cluster with the most similar cluster in the induced clustering obtained by adding data points, (b) the dissolution point, which is an adaptation of the breakdown point concept to single clusters, (c) isolation robustness: given a clustering method, is it possible to join, by addition of g points, arbitrarily well separated clusters?Results are derived for k-means, k-medoids (k estimated by average silhouette width), trimmed k-means, mixture models (with and without noise component, with and without estimation of the number of clusters by BIC), single and complete linkage.  相似文献   

10.
We study fair center based clustering problems. In an influential paper, Chierichetti, Kumar, Lattanzi and Vassilvitskii (NIPS 2017) consider the problem of finding a good clustering, say of women and men, such that every cluster contains an equal number of women and men. They were able to obtain a constant factor approximation for this problem for most center based k-clustering objectives such as k-median, k-means, and k-center. Despite considerable interest in extending this problem for multiple protected attributes (e.g. women and men, with or without citizenship), so far constant factor approximations for these problems have remained elusive except in special cases. We settle this question in the affirmative by giving the first constant factor approximation for a wide range of center based k-clustering objectives.  相似文献   

11.
Fitting semiparametric clustering models to dissimilarity data   总被引:1,自引:0,他引:1  
The cluster analysis problem of partitioning a set of objects from dissimilarity data is here handled with the statistical model-based approach of fitting the “closest” classification matrix to the observed dissimilarities. A classification matrix represents a clustering structure expressed in terms of dissimilarities. In cluster analysis there is a lack of methodologies widely used to directly partition a set of objects from dissimilarity data. In real applications, a hierarchical clustering algorithm is applied on dissimilarities and subsequently a partition is chosen by visual inspection of the dendrogram. Alternatively, a “tandem analysis” is used by first applying a Multidimensional Scaling (MDS) algorithm and then by using a partitioning algorithm such as k-means applied on the dimensions specified by the MDS. However, neither the hierarchical clustering algorithms nor the tandem analysis is specifically defined to solve the statistical problem of fitting the closest partition to the observed dissimilarities. This lack of appropriate methodologies motivates this paper, in particular, the introduction and the study of three new object partitioning models for dissimilarity data, their estimation via least-squares and the introduction of three new fast algorithms.  相似文献   

12.
We propose a method to automatically decompose domains in the context of semiclassical Bohmian mechanics. The algorithm is based on the approximate quantum potential method and the technique of k-means clustering. Two numerical examples, static analysis of quantum forces for a Pearson Type IV distribution and temporal analysis of the scattering on the Eckart barrier, are presented to show the viability of the method. The first example demonstrates that approximate quantum forces using our domain decomposition technique achieves convergence as the number of domains increases. In the second example, it is demonstrated that the domains constructed from k-means clustering has well adapted themselves to the evolving wave packet, providing coverage to both transmission and reflection waves. We also confirm that the use of multiple domains improves the evolution of the wave packet by comparing the result with the quantum mechanical solution, previously obtained. The computational cost remains manageable even with a naive implementation of time-consuming summation routines, but development of more sophisticated methodology is recommended for large scale, multidimensional calculations.  相似文献   

13.
The TCLUST procedure performs robust clustering with the aim of finding clusters with different scatter structures and weights. An Eigenvalues Ratio constraint is considered by TCLUST in order to achieve a wide range of clustering alternatives depending on the allowed differences among cluster scatter matrices. Moreover, this constraint avoids finding uninteresting spurious clusters. In order to guarantee the robustness of the method against the presence of outliers and background noise, the method allows for trimming of a given proportion of observations self-determined by the data. Based on this “impartial trimming”, the procedure is assumed to have good robustness properties. As it was done for the trimmed k-means method, this article studies robustness properties of the TCLUST procedure in the univariate case with two clusters by means of the influence function. The conclusion is that the TCLUST has a robustness behavior close to that of the trimmed k-means in spite of the fact that it addresses a more general clustering approach.  相似文献   

14.
Some of the most popular routing protocols for wireless sensor networks require a virtual backbone for efficient communication between the sensors. Connected dominating sets (CDS) have been studied as a method of choosing nodes to be in the backbone. The traditional approach is to assume that the transmission range of each node is given and to minimize the number of nodes in the CDS representing the backbone. A recently introduced alternative strategy is based on the concept of k-bottleneck connected dominating set (k-BCDS), which, given a positive integer k, minimizes the transmission range of the nodes that ensures a CDS of size k exists in the network. This paper provides a 6-approximate distributed algorithm for the k-BCDS problem. The results of empirical evaluation of the proposed algorithm are also included.  相似文献   

15.
In this note we improve an algorithm from a recent paper by Bauer and Bennett for computing a function of Erdös that measures the minimal gap size f(k) in the sequence of integers at least one of whose prime factors exceeds k. This allows us to compute values of f(k) for larger k and obtain new values of f(k).  相似文献   

16.
To find optimal clusters of functional objects in a lower-dimensional subspace of data, a sequential method called tandem analysis, is often used, though such a method is problematic. A new procedure is developed to find optimal clusters of functional objects and also find an optimal subspace for clustering, simultaneously. The method is based on the k-means criterion for functional data and seeks the subspace that is maximally informative about the clustering structure in the data. An efficient alternating least-squares algorithm is described, and the proposed method is extended to a regularized method. Analyses of artificial and real data examples demonstrate that the proposed method gives correct and interpretable results.  相似文献   

17.
This paper presents a multiobjective analog/RF circuit sizing tool using an improved brain storm optimization (IMBSO) algorithm with the purpose of analyzing the tradeoffs between competing performance specifications of analog/RF circuit block. A number of improvements are incorporated into IMBSO algorithm at different steps. At first, the clustering step of IMBSO algorithm is augmented with k-means\(++\) seeding technique to select the initial cluster centroids while clustering using k-means clustering technique. As a second improvement, the proposed IMBSO algorithm makes use of random probabilistic decision-making of river formation dynamics scheme to select optimal cluster centroids during population generation step. As a third improvement, an adaptive mutation operator is incorporated inside the IMBSO algorithm to generate new population. Finally, two separate constraint handling techniques are employed to handle both boundary and functional constraints during analog/RF circuit optimization. The performance of the proposed IMBSO algorithm is demonstrated in finding optimal Pareto fronts among different performance specifications of a two-stage operational amplifier circuit, a folded cascode amplifier circuit and a low noise amplifier circuit.  相似文献   

18.
We propose minimum volume ellipsoids (MVE) clustering as an alternative clustering technique to k-means for data clusters with ellipsoidal shapes and explore its value and practicality. MVE clustering allocates data points into clusters in a way that minimizes the geometric mean of the volumes of each cluster’s covering ellipsoids. Motivations for this approach include its scale-invariance, its ability to handle asymmetric and unequal clusters, and our ability to formulate it as a mixed-integer semidefinite programming problem that can be solved to global optimality. We present some preliminary empirical results that illustrate MVE clustering as an appropriate method for clustering data from mixtures of “ellipsoidal” distributions and compare its performance with the k-means clustering algorithm as well as the MCLUST algorithm (which is based on a maximum likelihood EM algorithm) available in the statistical package R. Research of the first author was supported in part by a Discovery Grant from NSERC and a research grant from Faculty of Mathematics, University of Waterloo. Research of the second author was supported in part by a Discovery Grant from NSERC and a PREA from Ontario, Canada.  相似文献   

19.
Packing coloring is a partitioning of the vertex set of a graph with the property that vertices in the i-th class have pairwise distance greater than i. The main result of this paper is a solution of an open problem of Goddard et al. showing that the decision whether a tree allows a packing coloring with at most k classes is NP-complete.We further discuss specific cases when this problem allows an efficient algorithm. Namely, we show that it is decideable in polynomial time for graphs of bounded treewidth and diameter, and fixed parameter tractable for chordal graphs.We accompany these results by several observations on a closely related variant of the packing coloring problem, where the lower bounds on the distances between vertices inside color classes are determined by an infinite nondecreasing sequence of bounded integers.  相似文献   

20.
Clustering is a fundamental problem in many scientific applications. Standard methods such as k-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations of k-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work, we present two splitting methods for solving the convex clustering problem. The first is an instance of the alternating direction method of multipliers (ADMM); the second is an instance of the alternating minimization algorithm (AMA). In contrast to previously considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks for solving the convex clustering problem under the previously studied norms and open the door to potentially novel norms. We demonstrate the performance of our algorithm on both simulated and real data examples. While the differences between the two algorithms appear to be minor on the surface, complexity analysis and numerical experiments show AMA to be significantly more efficient. This article has supplementary materials available online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号