首页 | 本学科首页   官方微博 | 高级检索  
     

使用变分自编码器的自回归多说话人中文语音合成
引用本文:蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
作者姓名:蒿晓阳  张鹏远
作者单位:1. 中国科学院声学研究所 语言声学与内容理解重点实验室 北京 100190;
基金项目:国家自然科学基金项目(11590773,11590770)资助;
摘    要:常见的多说话人语音合成有参数自适应及添加说话人标签两种方法。参数自适应方法获得的模型仅支持合成经过自适应的说话人的语音,模型不够鲁棒。传统的添加说话人标签的方法需要有监督地获得语音的说话人信息,并没有从语音信号本身无监督地学习说话人标签。为解决这些问题,提出了一种基于变分自编码器的自回归多说话人语音合成方法。方法首先利用变分自编码器无监督地学习说话人的信息并将其隐式编码为说话人标签,之后与文本的语言学特征送入到一个自回归声学参数预测网络中。此外,为了抑制多说话人语音数据引起的基频预测过拟合问题,声学参数网络采用了基频多任务学习的方法。预实验表明,自回归结构的加入降低了频谱误差1.018 dB,基频多任务学习降低了基频均方根误差6.861 Hz。在后续的多说话人对比实验中,提出的方法在3个多说话人实验的平均主观意见分(MOS)打分上分别达到3.71,3.55,3.15,拼音错误率分别为6.71%,7.54%,9.87%,提升了多说话人语音合成的音质。

关 键 词:语音合成  变分自编码器  自回归模型  多任务学习
收稿时间:2019-12-20
修稿时间:2020-06-01

Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder
HAO Xiaoyang, ZHANG Pengyuan. Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder[J]. ACTA ACUSTICA, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
Authors:HAO Xiaoyang  ZHANG Pengyuan
Affiliation:1. Key Laboratory of Speech Acoustics and Content Understanding Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190;2. University of Chinese Academy of Sciences, Beijing 100049
Abstract:Speaker adaption and speaker labels are two common methods for multi-speaker speech synthesis.The model obtained by speaker adaption can only support the speech of the adaptive speaker,and not robust enough.The conventional speaker label needs to obtain the speaker information of speech with supervision,and can’t learn the speaker label unsupervised from the speech itself.In order to solve the problems,a variational autoencoder based autoregressive multi-speaker framework is proposed.Firstly,speaker information is learned by variational autoencoder unsupervisedly and encoded into speaker labels.Then,speaker labels together with linguistic features are fed into an autoregressive acoustic model.Besides,acoustic model adopts multi-task learning to avoid the over-fitting of fundamental frequency.Pre-experiment shows,the autoregressive network structure decreases the cepstral distortion by 1.018 dB and root mean square error of fundamental frequency drops 6.861 Hz by multi-task learning.In the following comparative experiments,the Mean Opinion Score(MOS)scores respectively achieve 3.71,3.55,3.15 and Pinyin Error Rate achieve6.71%,7.54%,9.87%in three sub-tasks in multi-speaker speech synthesis by proposed method,which shows proposed methods observably improve the quality of synthesized speech. 
Keywords:Speech synthesis  Variational auto-encoder  Autoregressive model  Multi-task learning
点击此处可从《声学学报》浏览原始摘要信息
点击此处可从《声学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号