首页 | 本学科首页   官方微博 | 高级检索  
     

注意力机制融合前端网络中间层的语声情感识别
引用本文:朱应俊,周文君,朱川,马建敏. 注意力机制融合前端网络中间层的语声情感识别[J]. 应用声学, 2023, 42(5): 1090-1098
作者姓名:朱应俊  周文君  朱川  马建敏
作者单位:复旦大学航空航天系,复旦大学航空航天系,复旦大学航空航天系,复旦大学航空航天系
摘    要:为了使机器能够更好地理解人的情感并改善人机交互体验,可对语声特征及分类网络进行融合以提升情感识别性能。本文从网络融合的角度,把基于梅尔倒谱系数和逆梅尔倒谱系数的二维卷积神经网络和基于散射卷积网络系数的长短期记忆网络作为前端网络,提取前端网络的中间层作为话语级的特征表示,利用压缩-激励(SE)通道注意力机制对前端网络的中间层的权重进行调整并融合,然后由深度神经网络后端分类器输出情感分类结果。在汉语情感数据集中进行五折交叉验证的对比实验,实验结果表明,基于SE通道注意力机制的网络融合方式可以有效地利用不同前端网络在语声情感识别任务中的优势,提高语声情感识别的准确率。

关 键 词:注意力机制;语音特征;网络融合
收稿时间:2022-06-04
修稿时间:2023-08-29

Speech emotion recognition using the attention mechanism to fuse the intermediate layer of front-end networks
ZHU YINGJUN,ZHOU WENJUN,ZHU CHUAN and MA JIANMIN. Speech emotion recognition using the attention mechanism to fuse the intermediate layer of front-end networks[J]. Applied Acoustics(China), 2023, 42(5): 1090-1098
Authors:ZHU YINGJUN  ZHOU WENJUN  ZHU CHUAN  MA JIANMIN
Affiliation:Department of Aeronautics and Astronautics, Fudan University,Department of Aeronautics and Astronautics, Fudan University,Department of Aeronautics and Astronautics,Fudan University,Department of Aeronautics and Astronautics,Fudan University
Abstract:In order to enable machines to better understand human emotions and improve human-computer interaction experience, speech features and classification networks can be fused to improve emotion recognition performance. From the perspective of network fusion, this paper build front-end networks including two dimensional convolutional neural network (2D-CNN) based on Mel-frequency cepstral coefficients, 2D-CNN based on inverted Mel-frequency cepstral coefficients, long short-term memory based on scattering convolution network coefficients. The intermediate layer of the front-end networks are then extracted as the feature representation of the discourse level, and the squeeze-and-excitation (SE) channel attention mechanism is introduced to adjust and fuse the weights of the intermediate layer. Eventually the sentiment classification results are output by the back-end network based on the deep neural network. A comparison experiment of five-fold cross-validation was carried out on the Chinese speech emotion data set. The experimental result showed that the network fusion based on the SE channel attention mechanism can effectively utilize the advantages of different front-end networks in speech emotion recognition tasks, and improve the accuracy of speech emotion recognition.
Keywords:Attention mechanism   Speech feature  Network fusion
点击此处可从《应用声学》浏览原始摘要信息
点击此处可从《应用声学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号