使用各种U-NET体系结构来增强语音

论文标题

使用各种U-NET体系结构来增强语音

Towards speech enhancement using a variational U-Net architecture

论文作者

Nustede, Eike J., Anemüller, Jörn

论文摘要

我们研究了单渠道音频数据的变异U-NET体系结构的可行性。深网语音增强系统通常旨在估计滤膜或选择在波形信号上工作，从而有可能忽略跨较高的光谱特征的关系。我们研究了概率瓶颈的采用，用于直接光谱重建的经典U-NET结构。使用信噪比和感知度量进行评估，对包括已知和未知噪声类型的音频数据以及混响进行评估。我们的实验表明，提出的系统中的残差（跳过）连接是成功光谱重建的先决条件，即没有过滤器掩盖估计。结果平均表明，在回响条件下，PESQ和STOI分数分别在0.31和6.98的经典，信号增强性能中，所提出的变分U-NET体系结构的优势比其经典的非不同版本。轶事证据指出，与复发性掩码估计网络基线相比，用变异的U-NET改善了冲动性噪声源的抑制。

We investigate the viability of a variational U-Net architecture for denoising of single-channel audio data. Deep network speech enhancement systems commonly aim to estimate filter masks, or opt to work on the waveform signal, potentially neglecting relationships across higher dimensional spectro-temporal features. We study the adoption of a probabilistic bottleneck into the classic U-Net architecture for direct spectral reconstruction. Evaluation of several ablation network variants is carried out using signal-to-distortion ratio and perceptual measures, on audio data that includes known and unknown noise types as well as reverberation. Our experiments show that the residual (skip) connections in the proposed system are a prerequisite for successful spectral reconstruction, i.e., without filter mask estimation. Results show, on average, an advantage of the proposed variational U-Net architecture over its classic, non-variational version in signal enhancement performance under reverberant conditions of 0.31 and 6.98 in PESQ and STOI scores, respectively. Anecdotal evidence points to improved suppression of impulsive noise sources with the variational U-Net compared to the recurrent mask estimation network baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题