首页 | 本学科首页   官方微博 | 高级检索  
     检索      

融合辅助目标学习和卷积循环网络的非侵入式语音质量评价算法
引用本文:唐闺臣,梁瑞宇,孔凡留,谢跃,鞠梦洁.融合辅助目标学习和卷积循环网络的非侵入式语音质量评价算法[J].声学学报,2022,47(5):692-702.
作者姓名:唐闺臣  梁瑞宇  孔凡留  谢跃  鞠梦洁
作者单位:1. 南京工程学院 信息与通信工程学院 南京 211167;
基金项目:国家重点研发计划项目(2020YFC2004002)、国家自然科学基金项目(62001215)和南京工程学院校级科研基金项目(CKJC202001)资助
摘    要:语音质量的客观评价可以代替昂贵的人工评分,但是目前客观指标的计算通常需要纯净的参考语音,这在许多实际声学系统中很难获得。为此提出了一种融合辅助目标学习和卷积循环网络(CRN)的非侵入式语音质量评价算法。为降低算法的复杂度,算法采用基于仿人耳听觉特性滤波器的Bark频率倒谱系数(BFCCs)作为CRN的输入。算法首先构建一个卷积神经网络(CNN)从BFCCs中提取帧级特征。然后,构建双向的长短记忆网络,在帧级特征中建模长期的时间依赖性和序列特征。最后,利用自注意力机制自适应地从帧级特征中筛选出有用信息,将其整合至话语层面的特征中,并将这些话语级特征映射为客观得分。为改善质量评测的有效性,算法采用多任务训练策略,引入语音激活检测(VAD)作为辅助学习目标。基于开源数据库的实验显示,与其他非侵入式算法相比,提出的算法和平均主观意见分(MOS)具有更好的相关性。而且,算法参数规模较小且对ITU-T P.808发布的带有主观MOS的失真语音数据库具有良好的泛化能力,接近语音质量感知评估(PESQ)指标的精度。 

关 键 词:语音质量    非侵入式语音评价    语音增强    辅助目标学习    卷积循环网络
收稿时间:2021-12-13

A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network
Institution:1. School of Information and Communication Engineering, Nanjing Institute of Technology Nanjing 211167;2. School of Information Science and Engineering, Southeast University Nanjing 210096
Abstract:The objective evaluation of speech quality can replace expensive manual scoring,but current objective indicators usually need pure reference speech,which is difficult to obtain in many practical acoustic systems.A noninvasive speech quality evaluation algorithm combining auxiliary target learning and Convolutional Recurrent Network(CRN) is proposed.Bark Frequency Cepstral Coefficients(BFCCs) which are based on human-like auditory filters,are used as the input of the CRN network to effectively reduce the network complexity.Firstly,frame-level features are extracted by a Convolutional Neural Network(CNN) from BFCCs.Then,long-term time dependence and sequence features are modeled by the Bidirectional Long Short-Term Memory(BiLSTM) networks in frame-level features.Finally,a self-attention mechanism is introduced into the CRN,thereby adaptively extracting useful information from frame-level features,which is then integrated into the characteristics of the sentence level and mapped into the final objective score.In addition,a multi-task training strategy is adopted,and Voice Activity Detection(VAD) is introduced as an auxiliary learning target to improve the performance of the algorithm.The experiments in public databases show that compared with other non-invasive algorithms,the proposed algorithm has a better correlation with the mean opinion score(MOS).Moreover,it has a small parameter size and good generalization ability for the distorted speech database with MOS released by ITU-T P.808,which is close to the accuracy of the Perceptual Evaluation of Speech Quality(PESQ). 
Keywords:
点击此处可从《声学学报》浏览原始摘要信息
点击此处可从《声学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号