论文标题
具有框架和样式重建损失的表现性TTS培训
Expressive TTS Training with Frame and Style Reconstruction Loss
论文作者
论文摘要
我们为基于Tacotron的文本到语音(TTS)系统提出了一种新颖的培训策略,以提高语音的表现力。韵律建模的关键挑战之一是缺乏使明确建模变得困难的参考。提出的技术不需要训练数据中的韵律注释。它也不会显式地对韵律进行建模,而是使用基于Tacotron的TTS框架编码输入文本及其韵律样式之间的关联。我们提出的想法标志着与风格的令牌范式背道而驰,在该范式上,韵律是由韵律库嵌入的。提出的训练策略采用了两个目标函数的组合:1)框架水平重建损失,这是在合成和目标频谱特征之间计算出来的; 2)说话水平样式的重建损失,这是在合成和目标语音的深层风格特征之间计算出来的。提出的样式重建损失被称为感知损失,以确保在训练过程中考虑话语水平的语音风格。实验表明,所提出的培训策略可以实现出色的性能,并在自然性和表现力上都优于最先进的基线。据我们所知,这是将话语水平感知质量作为损失功能纳入TACOTRON培训的第一项研究,以提高表现力。
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system to improve the expressiveness of speech. One of the key challenges in prosody modeling is the lack of reference that makes explicit modeling difficult. The proposed technique doesn't require prosody annotations from training data. It doesn't attempt to model prosody explicitly either, but rather encodes the association between input text and its prosody styles using a Tacotron-based TTS framework. Our proposed idea marks a departure from the style token paradigm where prosody is explicitly modeled by a bank of prosody embeddings. The proposed training strategy adopts a combination of two objective functions: 1) frame level reconstruction loss, that is calculated between the synthesized and target spectral features; 2) utterance level style reconstruction loss, that is calculated between the deep style features of synthesized and target speech. The proposed style reconstruction loss is formulated as a perceptual loss to ensure that utterance level speech style is taken into consideration during training. Experiments show that the proposed training strategy achieves remarkable performance and outperforms a state-of-the-art baseline in both naturalness and expressiveness. To our best knowledge, this is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness.