SNAC：在基于流量的架构中，用于零击的多扬声器文本到语音的扬声器标准化仿期耦合层

论文标题

SNAC：在基于流量的架构中，用于零击的多扬声器文本到语音的扬声器标准化仿期耦合层

SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

论文作者

Choi, Byoung Jin, Jeong, Myeonghun, Lee, Joun Yeop, Kim, Nam Soo

论文摘要

零击的多演讲者文本对语音（ZSM-TTS）模型旨在生成具有看不见的说话者语音特征的语音样本。 ZSM-TTS的主要挑战是增加看不见的说话者的总体说话者相似性。用于基于流量的多演讲者文本到语音（TTS）模型的最成功的扬声器调节方法之一是利用函数，这些功能可以根据给定的扬声器嵌入矢量来预测仿射耦合层的规模和偏差参数。在这封信中，我们通过引入说话者归一化的仿射耦合（SNAC）层来改进以前的扬声器调节方法，该层允许以零拍的方式利用基于归一化的调理技术，以零拍的方式允许看不见的说话者语音合成。新设计的耦合层明确地将训练矢量嵌入矢量预测的参数明确归一化，从而使推理中嵌入的新扬声器嵌入的反向过程。提出的条件方案从ZSM-TTS设置中的语音质量和说话者的相似性方面产生了最先进的性能。

Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based conditioning technique. The newly designed coupling layer explicitly normalizes the input by the parameters predicted from a speaker embedding vector while training, enabling an inverse process of denormalizing for a new speaker embedding at inference. The proposed conditioning scheme yields the state-of-the-art performance in terms of the speech quality and speaker similarity in a ZSM-TTS setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题