一阶Ambisonics录音的多通道语音增强的基于U-NET扩张的方法

论文标题

一阶Ambisonics录音的多通道语音增强的基于U-NET扩张的方法

Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings

论文作者

Bosca, Amélie, Guérin, Alexandre, Perotin, Lauréline, Kitić, Srđan

论文摘要

我们介绍了CNN架构，以增强多通道一阶Ambisonics混合物的语音。从基于面具的方法中得出的数据依赖性空间过滤器用于帮助自动语音识别引擎面对混响和竞争扬声器的不利条件。掩模预测由神经网络提供，在已知到达方向的假设下，对语音和噪声幅度光谱进行了粗略的估计。这项研究评估了替换在更受压力条件下，替换了先前由备用第二个竞争扬声器的备量U-NET研究的复发性LSTM网络。我们表明，由于更准确的短期掩码预测，U-NET体系结构在单词错误率方面带来了一些改进。此外，结果表明，在两个干涉扬声器的困难情况下，使用扩张的卷积层的使用是有益的，并且/或目标和干扰在角度距离方面彼此接近。此外，这些结果的参数数量减少了两倍。

We present a CNN architecture for speech enhancement from multichannel first-order Ambisonics mixtures. The data-dependent spatial filters, deduced from a mask-based approach, are used to help an automatic speech recognition engine to face adverse conditions of reverberation and competitive speakers. The mask predictions are provided by a neural network, fed with rough estimations of speech and noise amplitude spectra, under the assumption of known directions of arrival. This study evaluates the replacing of the recurrent LSTM network previously investigated by a convolutive U-net under more stressing conditions with an additional second competitive speaker. We show that, due to more accurate short-term masks prediction, the U-net architecture brings some improvements in terms of word error rate. Moreover, results indicate that the use of dilated convolutive layers is beneficial in difficult situations with two interfering speakers, and/or where the target and interferences are close to each other in terms of the angular distance. Moreover, these results come with a two-fold reduction in the number of parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题