首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is an important problem in data mining. It can be formulated as a nonsmooth, nonconvex optimization problem. For the most global optimization techniques this problem is challenging even in medium size data sets. In this paper, we propose an approach that allows one to apply local methods of smooth optimization to solve the clustering problems. We apply an incremental approach to generate starting points for cluster centers which enables us to deal with nonconvexity of the problem. The hyperbolic smoothing technique is applied to handle nonsmoothness of the clustering problems and to make it possible application of smooth optimization algorithms to solve them. Results of numerical experiments with eleven real-world data sets and the comparison with state-of-the-art incremental clustering algorithms demonstrate that the smooth optimization algorithms in combination with the incremental approach are powerful alternative to existing clustering algorithms.  相似文献   

2.
In the Capacitated Clustering Problem (CCP), a given set of n weighted points is to be partitioned into p clusters such that, the total weight of the points in each cluster does not exceed a given cluster capacity. The objective is to find a set of p centers that minimises total scatter of points allocated to them. In this paper a new constructive method, a general framework to improve the performance of greedy constructive heuristics, and a problem space search procedure for the CCP are proposed. The constructive heuristic finds patterns of natural subgrouping in the input data using concept of density of points. Elements of adaptive computation and periodic construction–deconstruction concepts are implemented within the constructive heuristic to develop a general framework for building efficient heuristics. The problem-space search procedure is based on perturbations of input data for which a controlled perturbation strategy, intensification and diversification strategies are developed. The implemented algorithms are compared with existing methods on a standard set of bench-marks and on new sets of large-sized instances. The results illustrate the strengths of our algorithms in terms of solution quality and computational efficiency.  相似文献   

3.
Tag SNP selection is an important problem in genetic association studies. A class of algorithms to perform this task, among them a popular tool called Tagger, can be described as searching for a minimal vertex cover of a graph. In this article this approach is contrasted with a recently introduced clustering algorithm based on the graph theoretical concept of dominant sets. To compare the performance of both procedures comprehensive simulation studies have been performed using SNP data from the ten ENCODE regions included in the HapMap project. Quantitative traits have been simulated from additive models with a single causative SNP. Simulation results suggest that clustering performs always at least as good as Tagger, while in more than a third of the considered instances substantial improvement can be observed. Additionally an extension of the clustering algorithm is described which can be used for larger genomic data sets.  相似文献   

4.
The paper advocates the use of a new fuzzy-based clustering algorithm for document categorization. Each document/datum will be represented as a fuzzy set. In this respect, the fuzzy clustering algorithm, will be constrained additionally in order to cluster fuzzy sets. Then, one needs to find a metric measure in order to detect the overlapping between documents and the cluster prototype (category). In this respect, we use one of the interclass probabilistic reparability measures known as Bhattacharyya distance, which will be incorporated in the general scheme of the fuzzy c-means algorithm for measuring the overlapping between fuzzy sets. This enables the introduction of fuzziness in the document clustering in the sense that it allows a single document to belong to more than one category. This is in line with semantic multiple interpretations conveyed by single words, which support multiple membership to several classes. Performances of the algorithms will be illustrated using a case study from the construction sector.  相似文献   

5.
Grouping objects into different categories is a basic means of cognition. In the fields of machine learning and statistics, this subject is addressed by cluster analysis. Yet, it is still controversially discussed how to assess the reliability and quality of clusterings. In particular, it is hard to determine the optimal number of clusters inherent in the underlying data. Running different cluster algorithms and cluster validation methods usually yields different optimal clusterings. In fact, several clusterings with different numbers of clusters are plausible in many situations, as different methods are specialized on diverse structural properties. To account for the possibility of multiple plausible clusterings, we employ a multi-objective approach for collecting cluster alternatives (MOCCA) from a combination of cluster algorithms and validation measures. In an application to artificial data as well as microarray data sets, we demonstrate that exploring a Pareto set of optimal partitions rather than a single solution can identify alternative solutions that are overlooked by conventional clustering strategies. Competitive solutions are hereby ranked following an impartial criterion, while the ultimate judgement is left to the investigator.  相似文献   

6.
In this paper, we propose a new kernel-based fuzzy clustering algorithm which tries to find the best clustering results using optimal parameters of each kernel in each cluster. It is known that data with nonlinear relationships can be separated using one of the kernel-based fuzzy clustering methods. Two common fuzzy clustering approaches are: clustering with a single kernel and clustering with multiple kernels. While clustering with a single kernel doesn’t work well with “multiple-density” clusters, multiple kernel-based fuzzy clustering tries to find an optimal linear weighted combination of kernels with initial fixed (not necessarily the best) parameters. Our algorithm is an extension of the single kernel-based fuzzy c-means and the multiple kernel-based fuzzy clustering algorithms. In this algorithm, there is no need to give “good” parameters of each kernel and no need to give an initial “good” number of kernels. Every cluster will be characterized by a Gaussian kernel with optimal parameters. In order to show its effective clustering performance, we have compared it to other similar clustering algorithms using different databases and different clustering validity measures.  相似文献   

7.
In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets.  相似文献   

8.
Cluster analysis, the determination of natural subgroups in a data set, is an important statistical methodology that is used in many contexts. A major problem with hierarchical clustering methods used today is the tendency for classification errors to occur when the empirical data departs from the ideal conditions of compact isolated clusters. Many empirical data sets have structural imperfections that confound the identification of clusters. We use a Self Organizing Map (SOM) neural network clustering methodology and demonstrate that it is superior to the hierarchical clustering methods. The performance of the neural network and seven hierarchical clustering methods is tested on 252 data sets with various levels of imperfections that include data dispersion, outliers, irrelevant variables, and nonuniform cluster densities. The superior accuracy and robustness of the neural network can improve the effectiveness of decisions and research based on clustering messy empirical data.  相似文献   

9.
One of the most significant discussions in the field of machine learning today is on the clustering ensemble. The clustering ensemble combines multiple partitions generated by different clustering algorithms into a single clustering solution. Genetic algorithms are known for their high ability to solve optimization problems, especially the problem of the clustering ensemble. To date, despite the major contributions to find consensus cluster partitions with application of genetic algorithms, there has been little discussion on population initialization through generative mechanisms in genetic-based clustering ensemble algorithms as well as the production of cluster partitions with favorable fitness values in first phase clustering ensembles. In this paper, a threshold fuzzy C-means algorithm, named TFCM, is proposed to solve the problem of diversity of clustering, one of the most common problems in clustering ensembles. Moreover, TFCM is able to increase the fitness of cluster partitions, such that it improves performance of genetic-based clustering ensemble algorithms. The fitness average of cluster partitions generated by TFCM are evaluated by three different objective functions and compared against other clustering algorithms. In this paper, a simple genetic-based clustering ensemble algorithm, named SGCE, is proposed, in which cluster partitions generated by the TFCM and other clustering algorithms are used as the initial population used by the SGCE. The performance of the SGCE is evaluated and compared based on the different initial populations used. The experimental results based on eleven real world datasets demonstrate that TFCM improves the fitness of cluster partitions and that the performance of the SGCE is enhanced using initial populations generated by the TFCM.  相似文献   

10.
A clustering methodology based on biological visual models that imitates how humans visually cluster data by spatially associating patterns has been recently proposed. The method is based on Cellular Neural Networks and some resolution adjustments. The Cellular Neural Network rebuilds low-density areas while different resolutions find the best clustering option. The algorithm has demonstrated good performance compared to other clustering techniques. However, its main drawbacks correspond to its inability to operate with more than two-dimensional data sets and the computational time required for the resolution adjustment mechanism. This paper proposes a new version of this clustering methodology to solve such flaws. In the new approach, a pre-processing stage is incorporated featuring a Self-Organization Map that maps complex high-dimensional relations into a reduced lattice yet preserving the topological organization of the initial data set. This reduced representation is employed as the two-dimensional data set for further processing. In the new version, the resolution adjustment process is also accelerated through the use of an optimization method that combines the Hill-Climbing and the Random Search techniques. By incorporating such mechanisms rather than evaluating all possible resolutions, the optimization strategy finds the best resolution for a clustering problem by using a limited number of iterations. The proposed approach has been evaluated, considering several two-dimensional and high-dimensional datasets. Experimental evidence exhibits that the proposed algorithm performs the clustering task over complex problems delivering a 46% faster on average than the original method. The approach is also compared to other popular clustering techniques reported in the literature. Computational experiments demonstrate competitive results in comparison to other algorithms in terms of accuracy and robustness.  相似文献   

11.
Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.  相似文献   

12.
Traditional c-means clustering partitions a group of objects into a number of non-overlapping sets. Rough sets provide more flexible and objective representation than classical sets with hard partition and fuzzy sets with subjective membership function for a given dataset. Rough c-means clustering and its extensions were introduced and successfully applied in many real life applications in recent years. Each cluster is represented by a reasonable pair of lower and upper approximations. However, the most available algorithms pay no attention to the influence of the imbalanced spatial distribution within a cluster. The limitation of the mean iterative calculation function, with the same weight for all the data objects in a lower or upper approximation, is analyzed. A hybrid imbalanced measure of distance and density for the rough c-means clustering is defined, and a modified rough c-means clustering algorithm is presented in this paper. To evaluate the proposed algorithm, it has been applied to several real world data sets from UCI. The validity of this algorithm is demonstrated by the results of comparative experiments.  相似文献   

13.
Summary  In the last decade, factorial and clustering techniques have been developed to analyze multidimensional interval data (MIDs). In classic data analysis, PCA and clustering of the most significant components are usually performed to extract cluster structure from data. The clustering of the projected data is then performed, once the noise is filtered out, in a subspace generated by few orthogonal variables. In the framework of interval data analysis, we propose the same strategy. Several computational questions arise from this generalization. First of all, the representation of data onto a factorial subspace: in classic data analysis projected points remain points, but projected MIDs do not remains MIDs. Further, the choice of a distance between the represented data: many distances between points can be computed, few distances between convex sets of points are defined. We here propose optimized techniques for representing data by convex shapes, for computing the Hausdorff distance between convex shapes, based on an L 2 norm, and for performing a hierarchical clustering of projected data.  相似文献   

14.
Two robustness criteria are presented that are applicable to general clustering methods. Robustness and stability in cluster analysis are not only data dependent, but even cluster dependent. Robustness is in the present paper defined as a property of not only the clustering method, but also of every individual cluster in a data set. The main principles are: (a) dissimilarity measurement of an original cluster with the most similar cluster in the induced clustering obtained by adding data points, (b) the dissolution point, which is an adaptation of the breakdown point concept to single clusters, (c) isolation robustness: given a clustering method, is it possible to join, by addition of g points, arbitrarily well separated clusters?Results are derived for k-means, k-medoids (k estimated by average silhouette width), trimmed k-means, mixture models (with and without noise component, with and without estimation of the number of clusters by BIC), single and complete linkage.  相似文献   

15.
In the capacitated p-median problem with single source constraint, also known as the capacitated clustering problem, a given set of n weighted points is to be partitioned into p clusters such that the total weight of the points in each cluster does not exceed a given cluster capacity. The objective is to find a set of p centres that minimizes the total scatter of points allocated to these clusters. In this paper, a (λ, μ)-interchange neighbourhood based on the concept of λ-interchange of points restricted to μ-adjacent clusters is proposed. Structural properties of centres are identified and exploited to derive special data structures for their efficient evaluations. Different search and selection strategies including the variable neighbourhood search descent with respect to μ-nearest points are investigated. The most efficient strategies are then embedded in a guided construction search metaheuristic framework based either on a periodic local search procedure or a greedy random adaptive search procedure to solve the problem. Computational experience is reported on a standard set of benchmarks. The computational experience demonstrates the competitive performance of the proposed algorithms when compared to the best-known procedures in the literature in terms of solution quality and computational requirement.  相似文献   

16.
In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.  相似文献   

17.
Given a set of moving points in d, we show how to cluster them in advance, using a small number of clusters, so that at any time this static clustering is competitive with the optimal k-center clustering at that time. The advantage of this approach is that it avoids updating the clustering as time passes. We also show how to maintain this static clustering efficiently under insertions and deletions. To implement this static clustering efficiently, we describe a simple technique for speeding up clustering algorithms and apply it to achieve faster clustering algorithms for several problems. In particular, we present a linear time algorithm for computing a 2-approximation to the k-center clustering of a set of n points in d. This slightly improves the algorithm of Feder and Greene, that runs in (n log k) time (which is optimal in the algebraic decision tree model).  相似文献   

18.
The performance of kernel-based method, such as support vector machine (SVM), is greatly affected by the choice of kernel function. Multiple kernel learning (MKL) is a promising family of machine learning algorithms and has attracted many attentions in recent years. MKL combines multiple sub-kernels to seek better results compared to single kernel learning. In order to improve the efficiency of SVM and MKL, in this paper, the Kullback–Leibler kernel function is derived to develop SVM. The proposed method employs an improved ensemble learning framework, named KLMKB, which applies Adaboost to learning multiple kernel-based classifier. In the experiment for hyperspectral remote sensing image classification, we employ feature selected through Optional Index Factor (OIF) to classify the satellite image. We extensively examine the performance of our approach in comparison to some relevant and state-of-the-art algorithms on a number of benchmark classification data sets and hyperspectral remote sensing image data set. Experimental results show that our method has a stable behavior and a noticeable accuracy for different data set.  相似文献   

19.
In this article, we present a randomized dynamic cluster algorithm for large data sets. It is based on the restricted random walk cluster algorithm by Schöll and Schöll-Paschinger that has given good results in past studies. We discuss different approaches for the clustering of dynamic data sets. In contrast to most of these methods, dynamic restricted random walk clustering is also efficient for a small percentage of changes in the data set and has the additional advantage that the updates asymptotically produce the same clusters as a reclustering with the static variant; there is thus no need for any reclustering ever. In addition, the method has a relatively low computational complexity which enables it to cluster large data sets.  相似文献   

20.
Clustering algorithms divide up a dataset into a set of classes/clusters, where similar data objects are assigned to the same cluster. When the boundary between clusters is ill defined, which yields situations where the same data object belongs to more than one class, the notion of fuzzy clustering becomes relevant. In this course, each datum belongs to a given class with some membership grade, between 0 and 1. The most prominent fuzzy clustering algorithm is the fuzzy c-means introduced by Bezdek (Pattern recognition with fuzzy objective function algorithms, 1981), a fuzzification of the k-means or ISODATA algorithm. On the other hand, several research issues have been raised regarding both the objective function to be minimized and the optimization constraints, which help to identify proper cluster shape (Jain et al., ACM Computing Survey 31(3):264–323, 1999). This paper addresses the issue of clustering by evaluating the distance of fuzzy sets in a feature space. Especially, the fuzzy clustering optimization problem is reformulated when the distance is rather given in terms of divergence distance, which builds a bridge to the notion of probabilistic distance. This leads to a modified fuzzy clustering, which implicitly involves the variance–covariance of input terms. The solution of the underlying optimization problem in terms of optimal solution is determined while the existence and uniqueness of the solution are demonstrated. The performances of the algorithm are assessed through two numerical applications. The former involves clustering of Gaussian membership functions and the latter tackles the well-known Iris dataset. Comparisons with standard fuzzy c-means (FCM) are evaluated and discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号