首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别
引用本文:曹荣贺,吴晓龙,冯畅,郑方,徐明星,哈妮克孜·伊拉洪,艾斯卡尔·艾木都拉.基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别[J].信号处理,2023,39(4):698-707.
作者姓名:曹荣贺  吴晓龙  冯畅  郑方  徐明星  哈妮克孜·伊拉洪  艾斯卡尔·艾木都拉
作者单位:1.新疆大学信息科学与工程学院,新疆 乌鲁木齐 830046
摘    要:情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信息用于预测。然而,当对话中前后的话语情感变化较大时,不加选择的引入前文情感信息容易给当前预测带来干扰。本文提出了基于Wav2vec2.0与语境情感信息补偿的方法,旨在从前文中选择与当前话语最相关的情感信息作为补偿。首先通过语境信息补偿模块从历史对话中选择可能对当前话语情感影响最大的话语的韵律信息,利用长短时记忆网络将韵律信息构建为语境情感信息补偿表征。然后,利用预训练模型Wav2vec2.0提取当前话语的嵌入表征,将嵌入表征与语境表征融合用于情感识别。本方法在IEMOCAP数据集上的识别性能为69.0%(WA),显著超过了基线模型。 

关 键 词:??情感识别    二元对话    情感补偿    Wav2vec2.0
收稿时间:2022-11-03

Wav2vec2.0 and Context Emotional Information Compensation Based Dialogue Speech Emotion Recognition
Institution:1.HANKIZ Yilahun12.ASKAR Hamdulla11.College of Information Science and Engineering Xinjiang University,Urumqi,Xinjiang 830046,China2.Center for Speech and Language Technologies,Beijing National Research Center for Information Science and Technology,Tsinghua University,Beijing 100084,China
Abstract:? ?Emotions play an important role in human interaction. In the sentences of daily dialogues, there exists phenomena like weak emotional feelings, complex emotional categories and high ambiguity, which makes dialogue speech emotion recognition a challenging task. In order to solve this problem, existing works use global emotional information for prediction by retrieving emotional information from the global dialogue. However, the indiscriminate use of preceding emotional information can interfere with prediction of the current one when the emotional changes between the preceding and subsequent utterances are large. This paper proposes a method based on Wav2vec2.0 and contextual emotional information compensation, aiming to select the most relevant emotional information from the preceding utterances as compensation. Firstly, through the contextual information compensation module, the prosodic information of importance to the current utterance in discourse is selected from the preceding, which is used to construct contextual emotion information compensation representation through the long-term and short-term memory network (LSTM). Then the embedded representation of the current utterance is extracted through Wav2vec2.0, concatenated with the contextual representation above to form a new emotional representation. The recognition performance of our method on the IEMOCAP dataset is 69.0% (WA), significantly outperforming the baseline model. 
Keywords:
点击此处可从《信号处理》浏览原始摘要信息
点击此处可从《信号处理》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号