流氓：强大的，高质量的神经声音

论文标题

流氓：强大的，高质量的神经声音

HooliGAN: Robust, High Quality Neural Vocoding

论文作者

McCarthy, Ollie, Ahmed, Zohaib

论文摘要

生成模型的最新发展表明，深度学习与传统的数字信号处理（DSP）技术相结合可以成功产生令人信服的小提琴样本[1]，即源兴声与波烯的结合可以产生高质量的声音器[2，3]，并且生成的对抗网络（GAN）训练可以改善天然性[4，5]。通过结合这些模型中的想法，我们介绍了Hooligan，这是一种具有最佳状态结果的强大声音码器，对较小的数据集（<30分钟的secdeddata）非常好，并在GPU上的2.2MHz和CPU上的35kHz生成音频。我们还显示了基于塔科隆的模型的简单修改，该模型允许与流氓无缝集成。我们的听力测试的结果表明，该模型能够通过各种大小数据集始终如一地输出高质量音频。我们在以下演示页面上提供样本：https：//resemble-ai.github.io/hooligan_demo/

Recent developments in generative models have shown that deep learning combined with traditional digital signal processing (DSP) techniques could successfully generate convincing violin samples [1], that source-excitation combined with WaveNet yields high-quality vocoders [2, 3] and that generative adversarial network (GAN) training can improve naturalness [4, 5]. By combining the ideas in these models we introduce HooliGAN, a robust vocoder that has state of the art results, finetunes very well to smaller datasets (<30 minutes of speechdata) and generates audio at 2.2MHz on GPU and 35kHz on CPU. We also show a simple modification to Tacotron-basedmodels that allows seamless integration with HooliGAN. Results from our listening tests show the proposed model's ability to consistently output high-quality audio with a variety of datasets, big and small. We provide samples at the following demo page: https://resemble-ai.github.io/hooligan_demo/

下载PDF全文

下载文献需遵守相关版权规定

论文标题