蒙版标量预测，用于改善强大的自动语音识别

论文标题

蒙版标量预测，用于改善强大的自动语音识别

Mask scalar prediction for improving robust automatic speech recognition

论文作者

Narayanan, Arun, Walker, James, Panchapagesan, Sankaran, Howard, Nathan, Koizumi, Yuma

论文摘要

使用基于神经网络的声学前端来改善流媒体自动语音识别（ASR）系统的鲁棒性，因为因果关系的限制以及前端处理在语音中引入的导致失真。已经证明基于时屏掩蔽的方法可以很好地工作，但是它们需要其他超参数来扩展口罩以限制语音失真。这种面膜标量通常是手工调整的，并保守选择。在这项工作中，我们提出了一种技术，以端到端的方式使用基于ASR的损失来预测面膜标量，而整体模型的大小和复杂性的增加最小。我们在两个强大的ASR任务上评估了该方法：在存在语音和非语音噪声的情况下进行多通道增强，以及声音回声取消（AEC）。结果表明，所提供的算法始终提高单词错误率（WER），而无需对使用手工调整的超参数的强基础进行任何其他调整：在噪声条件下增强多通道的高音高达16％，而AEC最多可用于AEC 7％。

Using neural network based acoustic frontends for improving robustness of streaming automatic speech recognition (ASR) systems is challenging because of the causality constraints and the resulting distortion that the frontend processing introduces in speech. Time-frequency masking based approaches have been shown to work well, but they need additional hyper-parameters to scale the mask to limit speech distortion. Such mask scalars are typically hand-tuned and chosen conservatively. In this work, we present a technique to predict mask scalars using an ASR-based loss in an end-to-end fashion, with minimal increase in the overall model size and complexity. We evaluate the approach on two robust ASR tasks: multichannel enhancement in the presence of speech and non-speech noise, and acoustic echo cancellation (AEC). Results show that the presented algorithm consistently improves word error rate (WER) without the need for any additional tuning over strong baselines that use hand-tuned hyper-parameters: up to 16% for multichannel enhancement in noisy conditions, and up to 7% for AEC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题