通过基于得分的扩散来增强通用语音

论文标题

通过基于得分的扩散来增强通用语音

Universal Speech Enhancement with Score-based Diffusion

论文作者

Serrà, Joan, Pascual, Santiago, Pons, Jordi, Araz, R. Oguz, Scaini, Davide

论文摘要

从语音音频中删除背景噪音一直是巨大努力的主题，尤其是由于虚拟沟通和业余记录的兴起，近年来。然而，背景噪声并不是唯一可以防止可理解性的不愉快干扰：混响，剪裁，编解码器，有问题的均衡，有限的带宽或不一致的响度同样令人不安且普遍存在。在这项工作中，我们建议将言语增强的任务视为一项整体努力，并提出了一种普遍的语音增强系统，同时解决了55种不同的扭曲。我们的方法由一种使用基于得分的扩散的生成模型以及一个多分辨率调节网络，该网络通过混合密度网络进行增强。我们表明，这种方法在专家听众执行的主观测试中大大优于艺术状态。我们还表明，尽管没有考虑任何特定的快速采样策略，但它仅通过4-8个扩散步骤就可以实现竞争性的目标得分。我们希望我们的方法论和技术贡献都鼓励研究人员和从业人员采用普遍的语音增强方法，可能将其作为一项生成任务。

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题