首页 | 本学科首页   官方微博 | 高级检索  
     

联合精确比值掩蔽与深度神经网络的单通道语音增强方法
引用本文:柏浩钧, 张天骐, 刘鉴兴, 叶绍鹏. 联合精确比值掩蔽与深度神经网络的单通道语音增强方法[J]. 声学学报, 2022, 47(3): 394-404. DOI: 10.15949/j.cnki.0371-0025.2022.03.009
作者姓名:柏浩钧  张天骐  刘鉴兴  叶绍鹏
作者单位:1.重庆邮电大学 通信与信息工程学院 信号与信息处理重庆市重点实验室 重庆 400065
基金项目:国家自然科学基金项目(61671095,61702065,61701067,61771085);;信号与信息处理重庆市市级重点实验室建设项目(CSTC2009CA2003);;重庆市自然基金项目(cstc2021jcyj-msxmX0836);;重庆市教育委员会科研项目(KJ1600427,KJ1600429)资助;
摘    要:针对目前有监督语音增强忽略了纯净语音、噪声与带噪语音之间的幅度谱相似性对增强效果影响等问题,提出了一种联合精确比值掩蔽(ARM)与深度神经网络(DNN)的语音增强方法。该方法利用纯净语音与带噪语音、噪声与带噪语音的幅度谱归一化互相关系数,设计了一种基于时频域理想比值掩蔽的精确比值掩蔽作为目标掩蔽;然后以纯净语音和噪声幅度谱为训练目标的DNN为基线,通过该DNN的输出来估计目标掩蔽,并对基线DNN和目标掩蔽进行联合优化,增强语音由目标掩蔽从带噪语音中估计得到;此外,考虑到纯净语音与噪声的区分性信息,采用一种区分性训练函数代替均方误差(MSE)函数作为基线DNN的目标函数,以使网络输出更加准确。实验表明,区分性训练函数提升了基线DNN以及整个联合优化网络的增强效果;在匹配噪声和不匹配噪声下,相比于其它常见DNN方法,本文方法取得了更高的平均客观语音质量评估(PESQ)和短时客观可懂度(STOI),增强后的语音保留了更多语音成分,同时对噪声的抑制效果更加明显。

关 键 词:语音增强  深度神经网络  精确比值掩蔽  区分性训练
收稿时间:2020-11-16
修稿时间:2021-07-19

Speech enhancement combining accurate ratio masking and deep neural network
BO Haojun, ZHANG Tianqi, LIU Jianxing, YE Shaopeng. Speech enhancement combining accurate ratio masking and deep neural network[J]. ACTA ACUSTICA, 2022, 47(3): 394-404. DOI: 10.15949/j.cnki.0371-0025.2022.03.009
Authors:BO Haojun  ZHANG Tianqi  LIU Jianxing  YE Shaopeng
Affiliation:1.School of Communication and Information Engineering, Chongqing Key Laboratory of Signal and Information Processing(CQKLS&IP), Chongqing University of Posts and Telecommunications(CQUPT), Chongqing 400065
Abstract:Aiming at the problem that the impact of the similarity of amplitude spectrum between pure speech,noise,and noisy speech on enhancement effect is neglected in recent supervised speech enhancement,a method combining Accurate Ratio Masking(ARM)and Deep Neural Network(DNN)is proposed for monaural speech enhancement.Firstly,an accurate ratio masking based on ideal ratio masking in the time-frequency domain is designed,which utilizes the normalized cross-correlation coefficient of amplitude spectrum between pure speech and noisy speech,and between noise and noisy speech.Then,the target masking is estimated by the output of the baseline DNN which takes the amplitude spectrum of pure speech and noise as training target,and further uses the target masking to optimize the baseline DNN and get the enhanced speech from noisy speech.Moreover,considering the discriminative information between pure speech and noise,a discriminative training function is used to replace the Mean Square Error(MSE)as the objective function of the baseline DNN,thus making the output of network more accurate.The experimental results show that the discriminative training function improves the enhancement effect of baseline DNN and the overall joint optimization network Under matched and mismatched noise,compared with other common DNN methods,the proposed method gets higher average Perceptual Evaluation of Speech Quality(PESQ)and Short-Time Objective Intelligibility(STOI),and the enhanced speech retains more speech components and has a more obvious suppression effect on noise. 
Keywords:Speech enhancement  Deep neural network  Accurate ratio masking  Discriminative training
点击此处可从《声学学报》浏览原始摘要信息
点击此处可从《声学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号