语音识别，指导，束缚和自我监督的学习表示的端到端整合

论文标题

语音识别，指导，束缚和自我监督的学习表示的端到端整合

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

论文作者

Masuyama, Yoshiki, Chang, Xuankai, Cornell, Samuele, Watanabe, Shinji, Ono, Nobutaka

论文摘要

自我监督的学习表示（SSLR）已证明其在自动语音识别（ASR）中的重要性，主要是用干净的语音。最近的工作指出了将SSLR与噪音环境中ASR的单渠道语音增强相结合的力量。本文通过处理多通道输入，进一步推进了这一集成。我们通过在单个神经网络中集成了横断，波束成形，SSLR和ASR来提出一种新颖的端到端体系结构。我们的系统实现了文献中有关Chime-4 6通道轨道的最佳性能，单词错误率（WER）为1.77％。尽管基于WAVLM的强SSLR本身表现出有希望的结果，但与加权功率最小化无失真响应光束器的端到端集成同时执行了验证并降解，从而显着改善。在混响数据集中，它的有效性也得到了验证。

Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题