首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种面向隐含主题的上下文树核
引用本文:徐超,周一民,沈磊.一种面向隐含主题的上下文树核[J].电子与信息学报,2010,32(11):2695-2700.
作者姓名:徐超  周一民  沈磊
作者单位:北京航空航天大学计算机学院,北京,100191
摘    要:该文针对上下文树核用于文本表示时缺乏语义信息的问题,提出了一种面向隐含主题的上下文树核构造方法。首先采用隐含狄利克雷分配将文本中的词语映射到隐含主题空间,然后以隐含主题为单位建立上下文树模型,最后利用模型间的互信息构造上下文树核。该方法以词的语义类别来定义文本的生成模型,解决了基于词的文本建模时所遇到的统计数据的稀疏性问题。在文本数据集上的聚类实验结果表明,文中提出的上下文树核能够更好地度量文本间主题的相似性,提高了文本聚类的性能。

关 键 词:文本聚类    上下文树核    统计语言模型    隐含狄利克雷分配(LDA)
收稿时间:2009-11-20

A Context Tree Kernel Based on Latent Semantic Topic
Xu Chao,Zhou Yi-min,Shen Lei.A Context Tree Kernel Based on Latent Semantic Topic[J].Journal of Electronics & Information Technology,2010,32(11):2695-2700.
Authors:Xu Chao  Zhou Yi-min  Shen Lei
Institution:School of Computer, Beihang University, Beijing 100191, China
Abstract:The lack of semantic information is a critical problem of context tree kernel in text representation. A context tree kernel method based on latent topics is proposed. First, words are mapped to latent topic space through Latent Dirichlet Allocation(LDA). Then, context tree models are built using latent topics. Finally, context tree kernel for text is defined through mutual information between the models. In this approach, document generative models are defined using semantic class instead of words, and the issue of statistic data sparse is solved. The clustering experiment results on text data set show, the proposed context tree kernel is a better measure of topic similarity between documents, and the performance of text clustering is greatly improved.
Keywords:Text clustering  Context tree kernel  Statistical language models  Latent Dirichlet Allocation (LDA)
本文献已被 万方数据 等数据库收录!
点击此处可从《电子与信息学报》浏览原始摘要信息
点击此处可从《电子与信息学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号