首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Innovation in the cluster validating techniques
Authors:Ravi Jain  Andy Koronios
Institution:(1) School of Computer and Information Sciences, University of South Australia, Adelaide, Australia
Abstract:To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (γ index) algorithms for exact and approximate duplicates. In this paper, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.
Keywords:Clustering algorithms  Silhouette width  Calinski &  Harbasz index  Baker &  Hubert indices
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号