首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper we present a comparison among some nonhierarchical and hierarchical clustering algorithms including SOM (Self-Organization Map) neural network and Fuzzy c-means methods. Data were simulated considering correlated and uncorrelated variables, nonoverlapping and overlapping clusters with and without outliers. A total of 2530 data sets were simulated. The results showed that Fuzzy c-means had a very good performance in all cases being very stable even in the presence of outliers and overlapping. All other clustering algorithms were very affected by the amount of overlapping and outliers. SOM neural network did not perform well in almost all cases being very affected by the number of variables and clusters. The traditional hierarchical clustering and K-means methods presented similar performance.  相似文献   

2.
Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.  相似文献   

3.
The analysis of large-scale data sets using clustering techniques arises in many different disciplines and has important applications. Most traditional clustering techniques require heuristic methods for finding good solutions and produce suboptimal clusters as a result. In this article, we present a rigorous biclustering approach, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix. The physical permutations of the rows and columns are accomplished via a network flow model according to a given objective function. This optimal re-ordering model is used in an iterative framework where cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices. The performance of OREO is demonstrated on metabolite concentration data to validate the ability of the proposed method and compare it to existing clustering methods.  相似文献   

4.
Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been developed and deployed, most of them still suffer from the lack of automatic or online decision for optimal number of clusters. In this paper, we define clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering configurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods.  相似文献   

5.
Space-Time Point-Process Models for Earthquake Occurrences   总被引:5,自引:0,他引:5  
Several space-time statistical models are constructed based on both classical empirical studies of clustering and some more speculative hypotheses. Then we discuss the discrimination between models incorporating contrasting assumptions concerning the form of the space-time clusters. We also examine further practical extensions of the model to situations where the background seismicity is spatially non-homogeneous, and the clusters are non-isotropic. The goodness-of-fit of the models, as measured by AIC values, is discussed for two high quality data sets, in different tectonic regions. AIC also allows the details of the clustering structure in space to be clarified. A simulation algorithm for the models is provided, and used to confirm the numerical accuracy of the likelihood calculations. The simulated data sets show the similar spatial distributions to the real ones, but differ from them in some features of space-time clustering. These differences may provide useful indicators of directions for further study.  相似文献   

6.
Data clustering, also called unsupervised learning, is a fundamental issue in data mining that is used to understand and mine the structure of an untagged assemblage of data into separate groups based on their similarity. Recent studies have shown that clustering techniques that optimize a single objective may not provide satisfactory result because no single validity measure works well on different kinds of data sets. Moreover, the performance of clustering algorithms degrades with more and more overlaps among clusters in a data set. These facts have motivated us to develop a fuzzy multi-objective particle swarm optimization framework in an innovative fashion for data clustering, termed as FMOPSO, which is able to deliver more effective results than state-of-the-art clustering algorithms. The key challenge in designing FMOPSO framework for data clustering is how to resolve cluster assignments confusion with such points in the data set which have significant belongingness to more than one cluster. The proposed framework addresses this problem by identification of points having significant membership to multiple classes, excluding them, and re-classifying them into single class assignments. To ascertain the superiority of the proposed algorithm, statistical tests have been performed on a variety of numerical and categorical real life data sets. Our empirical study shows that the performance of the proposed framework (in both terms of efficiency and effectiveness) significantly outperforms the state-of-the-art data clustering algorithms.  相似文献   

7.
In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.  相似文献   

8.
Hierarchical hesitant fuzzy K-means clustering algorithm   总被引:1,自引:0,他引:1  
Due to the limitation and hesitation in one's knowledge, the membership degree of an element to a given set usually has a few different values, in which the conventional fuzzy sets are invalid. Hesitant fuzzy sets are a powerful tool to treat this case. The present paper focuses on investigating the clustering technique for hesitant fuzzy sets based on the K-means clustering algorithm which takes the results of hierarchical clustering as the initial clusters. Finally, two examples demonstrate the validity of our algorithm.  相似文献   

9.
Hierarchical and empirical Bayes approaches to inference are attractive for data arising from microarray gene expression studies because of their ability to borrow strength across genes in making inferences. Here we focus on the simplest case where we have data from replicated two colour arrays which compare two samples and where we wish to decide which genes are differentially expressed and obtain estimates of operating characteristics such as false discovery rates. The purpose of this paper is to examine the frequentist performance of Bayesian variable selection approaches to this problem for different prior specifications and to examine the effect on inference of commonly used empirical Bayes approximations to hierarchical Bayes procedures. The paper makes three main contributions. First, we describe how the log odds of differential expression can usually be computed analytically in the case where a double tailed exponential prior is used for gene effects rather than a normal prior, which gives an alternative to the commonly used B-statistic for ranking genes in simple comparative experiments. The second contribution of the paper is to compare empirical Bayes procedures for detecting differential expression with hierarchical Bayes methods which account for uncertainty in prior hyperparameters to examine how much is lost in using the commonly employed empirical Bayes approximations. Third, we describe an efficient MCMC scheme for carrying out the computations required for the hierarchical Bayes procedures. Comparisons are made via simulation studies where the simulated data are obtained by fitting models to some real microarray data sets. The results have implications for analysis of microarray data using parametric hierarchical and empirical Bayes methods for more complex experimental designs: generally we find that the empirical Bayes methods work well, which supports their use in the analysis of more complex experiments when a full hierarchical Bayes analysis would impose heavy computational demands.  相似文献   

10.
In recent years, hierarchical model-based clustering has provided promising results in a variety of applications. However, its use with large datasets has been hindered by a time and memory complexity that are at least quadratic in the number of observations. To overcome this difficulty, this article proposes to start the hierarchical agglomeration from an efficient classification of the data in many classes rather than from the usual set of singleton clusters. This initial partition is derived from a subgraph of the minimum spanning tree associated with the data. To this end, we develop graphical tools that assess the presence of clusters in the data and uncover observations difficult to classify. We use this approach to analyze two large, real datasets: a multiband MRI image of the human brain and data on global precipitation climatology. We use the real datasets to discuss ways of integrating the spatial information in the clustering analysis. We focus on two-stage methods, in which a second stage of processing using established methods is applied to the output from the algorithm presented in this article, viewed as a first stage.  相似文献   

11.
提出了一种在对预报因子集进行模糊聚类分析基础上构建径流预测模型的新方法:先通过模糊C-均值聚类将历史径流数据进行分类,然后利用小波神经网络分别建立预报因子集类别变量特征值与观测值之间的局部预测模型,并设计了特征值分类识别器,自动搜寻相适应的局部网络模型进行预测.通过西南某水库2011年日平均入库来流的计算实例对简单小波神经网络预测模型和所建的基于FCM与小波神经网络的预测模型进行了比较,结果较为满意.  相似文献   

12.
In this article, we present a randomized dynamic cluster algorithm for large data sets. It is based on the restricted random walk cluster algorithm by Schöll and Schöll-Paschinger that has given good results in past studies. We discuss different approaches for the clustering of dynamic data sets. In contrast to most of these methods, dynamic restricted random walk clustering is also efficient for a small percentage of changes in the data set and has the additional advantage that the updates asymptotically produce the same clusters as a reclustering with the static variant; there is thus no need for any reclustering ever. In addition, the method has a relatively low computational complexity which enables it to cluster large data sets.  相似文献   

13.
This paper compares two forms of experimental design methods that may be used for the development of regression and neural network simulation metamodels. The experimental designs considered are full factorial designs and random designs. The paper shows that, for two example problems, neural network metamodels using a randomised experimental design produce more accurate and efficient metamodels than those produced by similar sized factorial designs with either regression or neural networks. The metamodelling techniques are compared by their ability to predict the results from two manufacturing systems that have different levels of complexity. The results of the comparison suggest that neural network metamodels outperform conventional regression metamodels, especially when data sets based on randomised simulation experimental designs are used to produce the metamodels rather than data sets from similar sized full factorial experimental designs.  相似文献   

14.
This paper describes an analysis of IP-network traffic in terms of the time variations in multi-fractal scaling properties. To obtain a comprehensive view in assessing IP-network traffic conditions, we used a hierarchical clustering scheme, which provides a way to classify high-dimensional data into a tree-like structure. Based on sequential measurements of IP-network traffic at two locations, we checked time variations in multi-fractal-related properties of measured data sets. In performing the hierarchical clustering-based analysis, we used four parameters: the highest value and the range of generalized fractal dimensions, the network throughput, and the standard deviation of average throughput for each measured data set. The results confirmed that the traffic data could be classified in accordance with the network traffic properties, demonstrating that the combined depiction of the multi-fractal-related properties and other factors can give us an effective assessment of network conditions at different times.  相似文献   

15.
For hierarchical clustering, dendrograms are a convenient and powerful visualization technique. Although many visualization methods have been suggested for partitional clustering, their usefulness deteriorates quickly with increasing dimensionality of the data and/or they fail to represent structure between and within clusters simultaneously. In this article we extend (dissimilarity) matrix shading with several reordering steps based on seriation techniques. Both ideas, matrix shading and reordering, have been well known for a long time. However, only recent algorithmic improvements allow us to solve or approximately solve the seriation problem efficiently for larger problems. Furthermore, seriation techniques are used in a novel stepwise process (within each cluster and between clusters) which leads to a visualization technique that is able to present the structure between clusters and the micro-structure within clusters in one concise plot. This not only allows us to judge cluster quality but also makes misspecification of the number of clusters apparent. We give a detailed discussion of the construction of dissimilarity plots and demonstrate their usefulness with several examples. Experiments show that dissimilarity plots scale very well with increasing data dimensionality.

Supplemental materials with additional experiments for this article are available online.  相似文献   

16.
Currently, prenatal screening for Down Syndrome (DS) uses the mother's age as well as three biochemical markers for risk prediction. Risk calculations for the biochemical markers use a quadratic discriminant function. In this paper we compare several classification procedures to quadratic discrimination methods for biochemical-based DS risk prediction, based on data from a prospective multicentre prenatal screening study. We investigate alternative methods including linear discriminant methods, logistic regression methods, neural network methods, and classification and regression-tree methods. Several experiments are performed, and in each experiment resampling methods are used to create training and testing data sets. The procedures on the test data set are summarized by the area under their receiver operating characteristic curves. In each experiment this process is repeated 500 times and then the classification procedures are compared. We find that several methods are superior to the currently used quadratic discriminant method for risk estimation for these data. The implications of these results for prenatal screening programs are discussed.  相似文献   

17.
针对猪肉价格上下波动呈非线性关系和影响因素复杂等难以预测的问题,提出了基于PCA-GM-BP神经网络预测模型对猪肉价格进行有效预测.以2010年1月-2018年12月的月度价格数据作为样本,共计108组数据,利用PCA对影响猪肉价格变化的12种因素进行降维处理,选用对猪肉价格的主要累积贡献率超过96%的5个主成分,构建PCA-GM-BP神经网络猪肉价格预测模型.结果表明:与传统的BP神经网络、GM-BP神经网络预测模型相比,PCA-GM-BP神经网络预测模型在提高聚类效果的同时,增加了预测结果的精确性,对我国猪肉价格预测具有更高的适用性与参考价值性.  相似文献   

18.
Clustering analysis plays an important role in the filed of data mining. Nowadays, hierarchical clustering technique is becoming one of the most widely used clustering techniques. However, for most algorithms of hierarchical clustering technique, the requirements of high execution efficiency and high accuracy of clustering result cannot be met at the same time. After analyzing the advantages and disadvantages of the hierarchical algorithms, the paper puts forward a two-stage clustering algorithm, named Chameleon Based on Clustering Feature Tree (CBCFT), which hybridizes the Clustering Tree of algorithm BIRCH with algorithm CHAMELEON. By calculating the time complexity of CBCFT, the paper argues that the time complexity of CBCFT increases linearly with the number of data. By experimenting on sample data set, this paper demonstrates that CBCFT is able to identify clusters with large variance in size and shape and is robust to outliers. Moreover, the result of CBCFT is as similar as that of CHAMELEON, but CBCFT overcomes the shortcoming of the low execution efficiency of CHAMELEON. Although the execution time of CBCFT is longer than BIRCH, the clustering result of CBCFT is much satisfactory than that of BIRCH. Finally, through a case of customer segmentation of Chinese Petroleum Corp. HUBEI branch; the paper demonstrates that the clustering result of the case is meaningful and useful. The research is partially supported by National Natural Science Foundation of China (grants #70372049 and #70121001).  相似文献   

19.
A characteristic feature of many relevant real life networks, like the WWW, Internet, transportation and communication networks, or even biological and social networks, is their clustering structure. We discuss in this paper a novel algorithm to identify cluster sets of densely interconnected nodes in a network. The algorithm is based on local information and therefore it is very fast with respect other proposed methods, while it keeps a similar performance in detecting the clusters.  相似文献   

20.
Automatic clustering using genetic algorithms   总被引:2,自引:0,他引:2  
In face of the clustering problem, many clustering methods usually require the designer to provide the number of clusters as input. Unfortunately, the designer has no idea, in general, about this information beforehand. In this article, we develop a genetic algorithm based clustering method called automatic genetic clustering for unknown K (AGCUK). In the AGCUK algorithm, noising selection and division-absorption mutation are designed to keep a balance between selection pressure and population diversity. In addition, the Davies-Bouldin index is employed to measure the validity of clusters. Experimental results on artificial and real-life data sets are given to illustrate the effectiveness of the AGCUK algorithm in automatically evolving the number of clusters and providing the clustering partition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号