首页 | 本学科首页   官方微博 | 高级检索  
     检索      

全局卷积与亲和度融合的多模态特征蒸馏情感识别方法
引用本文:赵子平,高天,王欢.全局卷积与亲和度融合的多模态特征蒸馏情感识别方法[J].信号处理,2023,39(4):667-677.
作者姓名:赵子平  高天  王欢
作者单位:天津师范大学计算机与信息工程学院,天津 300387
基金项目:国家自然科学基金面上项目62071330
摘    要:为提升人机交互时的用户体验以及满足多元化用途的需求,交互设备正逐步引入情感智能技术,其中,实现产业和技术有效融合的前提是可以对人类情感状态进行正确的识别,然而,这仍然是一个具有挑战性的话题。随着多媒体时代的快速发展,越来越多可利用的模态信息便逐步被应用到情感识别系统中。因此,本文提出一种基于特征蒸馏的多模态情感识别模型。考虑到情感表达往往与音频信号的全局信息密切相关,提出了适应性全局卷积(Adaptive Global Convolution, AGC)来提升有效感受野的范围,特征图重要性分析(Feature Map Importance Analysis,FMIA)模块进一步强化情感关键特征。音频亲和度融合(Audio Affinity Fusion, AAF)模块通过音频-文本模态间的内在相关性建模亲和度融合权重,使两种模态的情感信息得到有效融合。此外,为了提升模型泛化能力,有效利用教师模型中概率分布所携带的隐藏知识,帮助学生模型获取更高级别的语义特征,提出了在多模态模型上使用特征蒸馏。最后,在交互式情感二元动作捕捉(Interactive Emotional Dyadic Mot...

关 键 词:多模态情感识别  感受野  特征蒸馏  特征融合
收稿时间:2022-12-01

Multimodal Feature Distillation Emotion Recognition Method with Global Convolution and Affinity Fusion
Institution:College of Computer and Information Engineering,Tianjin Normal University,Tianjin 300387,China
Abstract:? ?To enhance the user experience in human-computer interaction and to meet the needs of a variety of applications, interactive devices are gradually introducing emotional intelligence technologies, where the effective integration of industry and technology presupposes the ability to correctly recognize human emotional states, but this remains a challenging topic. With the rapid development of the multimedia era, more and more available modal information is gradually being used in emotion recognition systems. Therefore, this paper proposes a multimodal emotion recognition model based on feature distillation. In consideration of the fact that emotion expressions are often closely related to the global information of audio signals, Adaptive Global Convolution (AGC) is proposed to enhance the range of the effective receptive field, and Feature Map Importance Analysis (FMIA) module is proposed to further strengthen the emotion key features. The Audio Affinity Fusion (AAF) module models the affinity fusion weights through the intrinsic correlation between audio-text modalities, allowing the emotional information of both modalities to be effectively fused. In addition, the use of Feature Distillation (FD) on multimodal models is proposed in order to promote model generalization and to effectively exploit the hidden knowledge carried by the probability distribution in the teacher's model and to help the student model to acquire higher-level semantic features. Finally, the method was evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotion dataset and achieved a weighted accuracy (WA) of 75.2% and an unweighted accuracy (UA) of 75.8%, demonstrating the effectiveness of the model in improving the efficiency of emotion recognition. 
Keywords:
点击此处可从《信号处理》浏览原始摘要信息
点击此处可从《信号处理》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号