论文标题
用跨牙本质嵌入端到端语音综合改善韵律建模
Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis
论文作者
论文摘要
尽管韵律与话语结构的语言信息有关,但大多数文本到语音(TTS)系统仅考虑到每个句子中,这在将文本段落转换为自然和表现力的语音时都具有挑战性。在本文中,我们建议使用附近句子的文本嵌入,以在不使用任何明确的韵律功能的情况下以端到端的方式改善段落的韵律产生。更具体地说,由基于预先训练的BERT模型提取的句子嵌入的其他CU编码器生成的杂化(CU)上下文向量用于增强Tacotron2解码器的输入。研究了两种类型的BERT嵌入,这导致使用不同的Cu编码器结构。关于普通话的有声读物数据集和LJ语音英语有声读物数据集的实验结果表明,CU信息的使用可以改善合成语音的自然性和表现力。主观听力测试表明,大多数参与者更喜欢使用CU编码器生成的语音,而不是使用标准Tacotron2生成的语音。还发现,可以通过更改相邻句子来间接控制韵律。
Despite prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account that within each sentence, which makes it challenging when converting a paragraph of texts into natural and expressive speech. In this paper, we propose to use the text embeddings of the neighboring sentences to improve the prosody generation for each utterance of a paragraph in an end-to-end fashion without using any explicit prosody features. More specifically, cross-utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can improve the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using standard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.