论文标题
多孔语音综合的细粒度噪声控制
Fine-grained Noise Control for Multispeaker Speech Synthesis
论文作者
论文摘要
文本对语音(TTS)模型通常将语音属性(例如内容,说话者和韵律)分配为分离的陈述。额外的作品旨在明确地对声学条件进行明确的模型,以确定主要的语音因素,以确定主要的语言内容,即从语言上和蒂姆(Timbre),诸如disception的噪声和背景噪音,并提出噪音的噪音,并提出了噪音,并提出了噪音。建模。我们结合了对抗性训练,表示瓶颈和框架对框架建模,以学习框架级别的噪声表示。在同一端,我们通过完全分层的自动编码器(FVAE)进行细粒度的韵律建模,从而导致更具表现力的语音合成。
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.