轻巧且高保真的端到端文本到语音，具有多波段的生成和逆短期傅立叶变换

论文标题

轻巧且高保真的端到端文本到语音，具有多波段的生成和逆短期傅立叶变换

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

论文作者

Kawamura, Masaya, Shirahata, Yuma, Yamamoto, Ryuichi, Tachibana, Kentaro

论文摘要

我们使用多波段生成和逆短期傅立叶变换提出了轻巧的端到端文本到语音模型。我们的模型基于VIT，这是一种高质量的端到端文本到语音模型，但是采用了两个更改以提高推理：1）最昂贵的组件被用来生成波形的波形。与传统的轻型模型不同，这些模型分别采用优化或知识蒸馏来训练两个级联组件，我们的方法享有端到端优化的全部好处。实验结果表明，我们的模型将语音合成的语音与VIT合成的语音是自然的，同时在英特尔核心i7 CPU上实现了0.066的实时系数，比VIT快4.1倍。此外，该模型的较小版本在自然和推理速度方面显着优于轻质基线模型。可从https://github.com/masayakawamura/mb-istft-vits获得代码和音频样本。

We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题