基于STA-CRNN模型的语声情感识别* Speech emotion recognition based on STA-CRNN model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于STA-CRNN模型的语声情感识别*

引用本文：	张志浩,王坤侠.基于STA-CRNN模型的语声情感识别*[J].应用声学,2022,41(5):843-850.

作者姓名：	张志浩王坤侠

作者单位：	安徽建筑大学,安徽建筑大学

基金项目：	国家自然科学基金项目(62001004), 安徽省高校学科(专业)拔尖人才学术资助项目(gxbjZD2021067), 安徽建筑大学科研发展基金项目(JZ202118), 安徽省高校自然科学研究重点项目（KJ2020A0470）, 安徽建筑大学安徽省建筑声环境重点实验室开放课题资助（AAE2021ZR02）

摘要：	语声情感识别对人机交互和情感计算研究领域具有重要作用，各类研究方法层出不穷。近期研究学者应用卷积神经网络和长短期记忆网络方法提取对数Mel谱图空间特征和时间特征，取得了一定的成果。然而不论是卷积神经网络还是长短期记忆网络提取特征时，都会产生特征冗余，导致语声情感识别效果下降。针对这一问题，该文提出了一种基于时空注意力机制的卷积-递归神经网络模型，采用对数Mel谱图和其一阶差分、二阶差分作为特征输入，在使用卷积神经网络提取空间特征和长短期记忆网络提取时间特征时，加入空间注意力和时间注意力机制，从而使上述网络能够更好地提取到对数Mel谱图中有效表征情感的空间特征和时间特征。该模型在Emo-DB和IEMOCAP语声数据集上的加权准确率分别达到86.8%、69.4%，未加权准确率分别达到84.7%、65.5%，优于当前大多数先进方法。
关键词：	语声情感识别对数Mel频谱图时空注意力时间特征空间特征
收稿时间：	2022/3/15 0:00:00
修稿时间：	2022/9/2 0:00:00
Speech emotion recognition based on STA-CRNN model

Institution:	Anhui Jianzhu University,Anhui Jianzhu University

Abstract:	Speech emotion recognition (SER) plays an important role in the research fields of human-computer interaction and affective computing. Many new research methods have emerged. Recently, researchers applied convolutional neural network (CNN) and long short-term memory (LSTM) to extract spatial and temporal features from Log-Mel spectrum, and achieved better performance. However, when CNN and LSTM networks extract features, they will lead to feature redundancy and reduce the performance of speech emotion recognition. In this paper, we propose a convolution recursive neural network model based on spatiotemporal attention mechanism (STA-CRNN). The Log-Mel spectrum, its first-order difference and second-order difference are used as feature input. We extract spatial features by CNN and temporal features by LSTM, and adopt spatial attention and temporal attention mechanism to further decrease the redundancy of features. The experiment results show that the weighted accuracy (WA) of the model on Emo-DB and IEMOCAP Speech database are 86.8% and 69.4% respectively, and the unweighted accuracy (UA) are 84.7% and 65.5% respectively. The proposed model STA-CRNN achieves better performance than most advanced methods for SER.

Keywords:	Speech emotion recognition Log-Mel Spatiotemporal attention Time features Spatial features

	点击此处可从《应用声学》浏览原始摘要信息
	点击此处可从《应用声学》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏