首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种领域合成词的抽取方法
引用本文:刘,剑.一种领域合成词的抽取方法[J].太赫兹科学与电子信息学报,2014,12(6):870-873.
作者姓名:  
作者单位:1.The Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;2.PLA University of Foreign Languages,Luoyang Henan 471003,China
基金项目:国家973计划资助项目(2012CB316303);国家自然科学基金资助项目(60933005)
摘    要:构建领域本体的首要任务是获取领域相关的概念,这些概念很多是由常用词典库中没有收录的领域合成词组成,因此抽取领域合成词对于领域本体的构建至关重要.本文基于语言规则和统计技术,提出一种结合改进互信息和语言模板的领域合成词抽取方法.首先利用改进的互信息算法抽取由多字词单位构成的高频次候选领域合成词,在此基础上,利用语言模板匹配抽取低频次候选领域合成词,最后由专家进行检验,得到领域合成词集.实验结果表明,该算法的领域合成词提取准确率达到88.22%,适用于从大规模网页文本中自动高效地抽取领域合成词.

关 键 词:领域本体  互信息  语言模板  领域合成词
收稿时间:2013/12/11 0:00:00
修稿时间:2014/3/17 0:00:00

A method of domain compound words extraction
LIU Jian.A method of domain compound words extraction[J].Journal of Terahertz Science and Electronic Information Technology,2014,12(6):870-873.
Authors:LIU Jian
Institution:LIU Jian(1.The Institute of Computing Technology, Chinese Academy of Sciences,Beijing 100190,China;2.PLA University of Foreign Languages,Luoyang Henan 471003,China)
Abstract:The primary task of constructing domain ontology is to obtain the relevant domain concepts. Many of these concepts are composed of domain compound words which are not included in the common dictionaries. So it is essential to extract domain compound words for the construction of domain ontology. Based on linguistic rules and statistical techniques, a hybrid extraction method combining the improved mutual information and language templates is proposed. Firstly, it extracts high frequency candidate domain compound words formed by a multi-word units using improved mutual information algorithm. On this basis, it extracts low frequency candidate domain compound words by language templates. Finally, domain compound words can be obtained through experts check. Experimental results show that the algorithm achieves a precision of 88.22%, which indicates this technique is fit for automatically and effectually extracting domain compound words from large corpora.
Keywords:domain ontology  mutual information  language templates  domain compound words
本文献已被 维普 等数据库收录!
点击此处可从《太赫兹科学与电子信息学报》浏览原始摘要信息
点击此处可从《太赫兹科学与电子信息学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号