DMF-NET：一个脱钩风格的多频段融合模型，用于增强表语音

论文标题

DMF-NET：一个脱钩风格的多频段融合模型，用于增强表语音

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

论文作者

Yu, Guochen, Guan, Yuansheng, Meng, Weixin, Zheng, Chengshi, Wang, Hui

论文摘要

对于建模更多频段的困难和较大的计算复杂性，基于深神经网络的全频段语音增强仍然具有挑战性。先前的研究通常采用带有树皮和ERB量表的压缩全带语音特征，其频率分辨率相对较低，导致性能降低，尤其是在高频区域。在本文中，我们提出了一种脱钩风格的多波段融合模型，以执行全乐队的语音降级和验证。我们不是通过单个网络结构来优化全乐队的语音，而是将全带目标分解为多子带语音特征，然后采用多阶段链优化策略来逐步估算清洁频谱。具体而言，低 - （0-8 kHz），中间（8-16 kHz）和高频（16-24 kHz）区域由三个独立的子网络映射，然后融合以获得全频段清洁目标STFT谱。在两个公共数据集上进行的全面实验表明，该提出的方法在实际复杂方案中的语音质量和清晰度方面优于先前的高级系统，并产生有希望的性能。

For the difficulty and large computational complexity of modeling more frequency bands, full-band speech enhancement based on deep neural networks is still challenging. Previous studies usually adopt compressed full-band speech features in Bark and ERB scale with relatively low frequency resolution, leading to degraded performance, especially in the high-frequency region. In this paper, we propose a decoupling-style multi-band fusion model to perform full-band speech denoising and dereverberation. Instead of optimizing the full-band speech by a single network structure, we decompose the full-band target into multi sub-band speech features and then employ a multi-stage chain optimization strategy to estimate clean spectrum stage by stage. Specifically, the low- (0-8 kHz), middle- (8-16 kHz), and high-frequency (16-24 kHz) regions are mapped by three separate sub-networks and are then fused to obtain the full-band clean target STFT spectrum. Comprehensive experiments on two public datasets demonstrate that the proposed method outperforms previous advanced systems and yields promising performance in terms of speech quality and intelligibility in real complex scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题