多光谱：高多样性和高保真光谱图生成具有对抗性样式的语音合成

论文标题

多光谱：高多样性和高保真光谱图生成具有对抗性样式的语音合成

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

论文作者

Lee, Sang-Hoon, Yoon, Hyun-Wook, Noh, Hyeong-Rae, Kim, Ji-Hoon, Lee, Seong-Whan

论文摘要

虽然基于生成的对抗性网络（GAN）的神经文本到语音（TTS）系统显示出神经语音综合的显着改善，但没有TTS系统可以学会从具有对抗性反馈的文本序列中综合语音。由于仅靠对抗反馈就不足以训练发电机，因此与地面真相和生成的MEL光谱图相比，当前模型仍然需要重建损失。在本文中，我们提出了多光谱（MSG），该文章可以通过将发生器的自我监督的隐藏表示形式调节到有条件的歧视器中来训练多演讲者模型。这为发电机培训提供了更好的指导。此外，我们还提出了对抗性样式组合（ASC），以在看不见的说话风格和成绩单中更好地概括，从而可以从多个MEL光谱图中学习嵌入的组合样式的潜在表示。 MSG经过ASC和功能匹配培训，通过控制和混合单个说话样式（例如持续时间，音高和能量）来合成高多样性MEL光谱图。结果表明，MSG合成了高保真性MEL光谱图，该图的自然性MOS得分与地面真相 - 光谱图几乎相同。

While generative adversarial networks (GANs) based neural text-to-speech (TTS) systems have shown significant improvement in neural speech synthesis, there is no TTS system to learn to synthesize speech from text sequences with only adversarial feedback. Because adversarial feedback alone is not sufficient to train the generator, current models still require the reconstruction loss compared with the ground-truth and the generated mel-spectrogram directly. In this paper, we present Multi-SpectroGAN (MSG), which can train the multi-speaker model with only the adversarial feedback by conditioning a self-supervised hidden representation of the generator to a conditional discriminator. This leads to better guidance for generator training. Moreover, we also propose adversarial style combination (ASC) for better generalization in the unseen speaking style and transcript, which can learn latent representations of the combined style embedding from multiple mel-spectrograms. Trained with ASC and feature matching, the MSG synthesizes a high-diversity mel-spectrogram by controlling and mixing the individual speaking styles (e.g., duration, pitch, and energy). The result shows that the MSG synthesizes a high-fidelity mel-spectrogram, which has almost the same naturalness MOS score as the ground-truth mel-spectrogram.

下载PDF全文

下载文献需遵守相关版权规定

论文标题