基于统计声学模型的单元挑选语音合成算法 A controllable multi-lingual multi-speaker multi-style text-to-speech synthesis with multivariate information minimization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于统计声学模型的单元挑选语音合成算法

引用本文：	尚增强, 张鹏远, 王丽. 融合跨说话人韵律迁移的多语种文本到波形生成[J]. 声学学报, 2024, 49(1): 171-180. DOI: 10.12395/0371-0025.2022146

作者姓名：	尚增强张鹏远王丽

作者单位：	1.中国科学院声学研究所语言声学与内容理解重点实验室　北京　100190;2.中国科学院大学　北京　100049

基金项目：	国家重点研发计划 (2021YFC3320102, 2021YFC3320103)资助

摘要：	在多语种语音合成任务中, 由于单人多语种数据稀缺, 让一个音色同时支持多种语言合成变得非常困难。不同于已有方法仅在声学模型中解耦音色和发音, 提出一种融合跨说话人韵律迁移的端到端多语种语音合成方法, 采用两级层级条件变分自编码器直接建模从文本到波形的生成过程, 并解耦了音色、发音和韵律等信息。该方法通过迁移目标语种已有说话人的韵律风格来改善跨语种合成的韵律。实验表明, 所提模型在跨语种语音生成上获得了3.91和4.01的自然度和相似度平均意见得分, 相比基线跨语种合成字错误率降低到5.85%。韵律迁移以及消融实验也进一步证明了该方法的有效性。
关键词：	多语种语音合成韵律迁移变分自编码器韵律解耦
收稿时间：	2022-11-21
修稿时间：	2023-04-20
A controllable multi-lingual multi-speaker multi-style text-to-speech synthesis with multivariate information minimization

SHANG Zengqiang, ZHANG Pengyuan, WANG Li. Multilingual text-to-waveform with cross-speaker prosody transfer[J]. ACTA ACUSTICA, 2024, 49(1): 171-180. DOI: 10.12395/0371-0025.2022146

Authors:	SHANG Zengqiang ZHANG Pengyuan WANG Li

Affiliation:	1.Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences　Beijing　100190;2.University of Chinese Academy of Sciences　Beijing　100049

Abstract:	For the multilingual speech synthesis task, due to the scarcity of single-person multilingual data, it becomes very difficult for one voice to support multilingual synthesis at the same time. Unlike previous methods that only decouple timbre and pronunciation within acoustic models, this paper proposes an end-to-end multilingual speech synthesis method that incorporates cross-speaker prosody transfer, which uses a two-level hierarchical conditional variational auto-encoder to directly model the generation process from text-to-waveform and decouples timbre, pronunciation, and prosody. The method improves the prosody of cross-lingual synthesis by transferring the prosody style of existing speakers in the target language. Experiments reveal that the proposed model achieves an average opinion score of 3.91 and 4.01 for naturalness and similarity in cross-lingual speech generation. Objective indicators also show that the word error rate of this method is reduced to 5.85% compared with baselines. Besides, prosody transfer and ablation experiments further prove the effectiveness of proposed method.

Keywords:	Multilingual speech synthesis Prosody transfer Variational auto-encoder Prosody decouple

	点击此处可从《声学学报》浏览原始摘要信息
	点击此处可从《声学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏