Cluster Analysis for Large Datasets: An Effective Algorithm for Maximizing the Mixture Likelihood期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Cluster Analysis for Large Datasets: An Effective Algorithm for Maximizing the Mixture Likelihood

Authors:	Daniel A Coleman David L Woodruff

Institution:	1. Cytokinetics South , San Fransisco , CA , 94080 , USA;2. Graduate School of Management , University of California , Davis , CA , 95616 , USA

Abstract:	Abstract The primary model for cluster analysis is the latent class model. This model yields the mixture likelihood. Due to numerous local maxima, the success of the EM algorithm in maximizing the mixture likelihood depends on the initial starting point of the algorithm. In this article, good starting points for the EM algorithm are obtained by applying classification methods to randomly selected subsamples of the data. The performance of the resulting two-step algorithm, classification followed by EM, is compared to, and found superior to, the baseline algorithm of EM started from a random partition of the data. Though the algorithm is not complicated, comparing it to the baseline algorithm and assessing its performance with several classification methods is nontrivial. The strategy employed for comparing the algorithms is to identify canonical forms for the easiest and most difficult datasets to cluster within a large collection of cluster datasets and then to compare the performance of the two algorithms on these datasets. This has led to the discovery that, in the case of three homogeneous clusters, the most difficult datasets to cluster are those in which the clusters are arranged on a line and the easiest are those in which the clusters are arranged on an equilateral triangle. The performance of the two-step algorithm is assessed using several classification methods and is shown to be able to cluster large, difficult datasets consisting of three highly overlapping clusters arranged on a line with 10,000 observations and 8 variables.

Keywords:	Classification EM algorithm Local search

设为首页 | 免责声明 | 关于勤云 | 加入收藏