通过学习离散音素级别的韵律表示来控制语音综合

论文标题

通过学习离散音素级别的韵律表示来控制语音综合

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

论文作者

Ellinas, Nikolaos, Christidou, Myrsini, Vioni, Alexandra, Sung, June Sig, Chalamandaris, Aimilios, Tsiakoulis, Pirros, Mastorocostas, Paris

论文摘要

在本文中，我们提出了一种使用直觉离散标签对F0和持续时间的音素级韵律控制的新方法。我们提出了一个无监督的韵律聚类过程，该过程用于从MultiSpeaker语音数据集中离散语音级别的F0和持续时间功能。这些功能作为韵律标签的输入序列馈送到韵律编码器模块，该模块增强了基于自回归注意力的文本到语音模型。我们利用各种方法来改善韵律控制范围和覆盖范围，例如增强，F0归一化，持续时间平衡的聚类和无独立的聚类。最终模型可以为训练集中包含的所有扬声器提供细粒度的音素级别的韵律控制，同时保持说话者身份。我们不是依靠参考语音进行推论，而是引入了一个先前的韵律编码器，该编码者了解每个说话者的样式，并在不需要参考音频的情况下启用语音综合。我们还微调了多言式言论模型，以看不见的数据有限，作为现实的应用程序方案，并表明韵律控制功能得到了维护，并验证了与说话者无关的韵律群集有效。实验结果表明，该模型具有较高的输出语音质量，并且该方法允许在每个说话者的范围内有效的韵律控制，尽管多孔设置引入了差异。

In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

下载PDF全文

下载文献需遵守相关版权规定

论文标题