首页 | 本学科首页   官方微博 | 高级检索  
     


Mining categorical sequences from data using a hybrid clustering method
Authors:Luca De Angelis,José   G. Dias
Affiliation:1. Department of Statistical Sciences, Alma Mater Studiorum, University of Bologna, Via Belle Arti, 41, 40126 Bologna, Italy;2. Instituto Universitário de Lisboa (ISCTE-IUL), Business Research Unit, Portugal
Abstract:
The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.
Keywords:Data mining   Sequential data   Hidden Markov models   Clustering   Categorical data
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号