TFGAN：基于时间和频域基于高保真语音综合的生成对抗网络

论文标题

TFGAN：基于时间和频域基于高保真语音综合的生成对抗网络

TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis

论文作者

Tian, Qiao, Chen, Yi, Zhang, Zewang, Lu, Heng, Chen, Linghui, Xie, Lei, Liu, Shan

论文摘要

最近，基于GAN的语音合成方法（例如梅尔根）变得非常流行。与传统的基于自回旋的方法相比，基于平行结构的发电机使波形生成过程快速，稳定。但是，基于自回归的神经声码编码器（例如Wavernn）的言语质量仍然高于GAN。为了解决这个问题，我们提出了一个新型的Vocoder模型：TFGAN，它在时间和频域上都是对抗性的。一方面，我们建议在频域中将地面图形波形与合成的波形区分开，以提供更多的一致性保证，而不是仅在时域中。另一方面，与传统频率域的STFT损失方法或歧视者学习波形的特征映射丢失相反，我们提出了一组时间域损耗，鼓励发电机直接捕获波形。 TFGAN与梅尔根的合成速度几乎相同，但是我们的新型学习方法可显着提高忠诚度。在我们的实验中，TFGAN比在语音合成环境中比自回归的Vocoder展示了获得可比的平均意见评分（MOS）的能力。

Recently, GAN based speech synthesis methods, such as MelGAN, have become very popular. Compared to conventional autoregressive based methods, parallel structures based generators make waveform generation process fast and stable. However, the quality of generated speech by autoregressive based neural vocoders, such as WaveRNN, is still higher than GAN. To address this issue, we propose a novel vocoder model: TFGAN, which is adversarially learned both in time and frequency domain. On one hand, we propose to discriminate ground-truth waveform from synthetic one in frequency domain for offering more consistency guarantees instead of only in time domain. On the other hand, in contrast to the conventionally frequency-domain STFT loss approach or feature map loss by discriminator to learn waveform, we propose a set of time-domain loss that encourage the generator to capture the waveform directly. TFGAN has nearly same synthesis speed as MelGAN, but the fidelity is significantly improved by our novel learning method. In our experiments, TFGAN shows the ability to achieve comparable mean opinion score (MOS) than autoregressive vocoder under speech synthesis context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题