使用层次上下文信息的语音综合层次上下文信息进行表现力的口语建模

论文标题

使用层次上下文信息的语音综合层次上下文信息进行表现力的口语建模

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

论文作者

Lei, Shun, Zhou, Yixuan, Chen, Liyang, Wu, Zhiyong, Kang, Shiyin, Meng, Helen

论文摘要

先前关于表达性语音合成的作品主要集中于当前句子。相邻句子中的上下文被忽略了，从而导致同一文本的僵化的口语风格，这缺乏语音变化。在本文中，我们提出了一个分层框架，以从上下文中建模口语样式。提出了层次上下文编码器，以探索考虑上下文中结构关系的更广泛范围的上下文信息，包括短语间和句子间关系。此外，为了鼓励该编码器更好地学习样式表示形式，我们引入了一种新颖的培训策略，并提供了知识蒸馏，这为编码器培训提供了目标。对普通话讲座数据集的客观和主观评估都表明，所提出的方法可以显着改善合成语音的自然性和表现性。

Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题