用预验证的变压器进行超低 - 二甲酸酯语音编码

论文标题

用预验证的变压器进行超低 - 二甲酸酯语音编码

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

论文作者

Siahkoohi, Ali, Chinen, Michael, Denton, Tom, Kleijn, W. Bastiaan, Skoglund, Jan

论文摘要

语音编码促进了语音在低频带宽度网络上的传播，而失真最小。基于神经网络的语音编解码器最近显示出对传统方法的质量有了显着改善。尽管这种新一代的编解码器能够综合高保真语音，但它们对经常性或卷积层的使用通常会限制其有效的接受场，从而阻止他们有效地压缩语音。我们建议通过使用经过预定的变压器进一步降低神经语音编解码器的比特率，该变压器能够由于其电感偏置而在输入信号中利用长距离依赖性。因此，我们与卷积编码器同时使用了经过验证的变压器，该卷积编码器是通过量化器和生成的对抗性净解码器端到端训练的。我们的数值实验表明，补充神经语音编解码器的卷积编码器使用变压器语音嵌入会产生一个语音编解码器，比特率$ 600 \，\ mathrm {bps} $在同一bitrate训练时，在合成的语音质量中均超过了原始的神经语音编解码器。主观的人类评估表明，所得编解码器的质量比运行率的三到四倍的传统编解码器的质量可比或更好。

Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题