端到端语音识别的残留语言模型

论文标题

端到端语音识别的残留语言模型

Residual Language Model for End-to-end Speech Recognition

论文作者

Tsunoo, Emiru, Kashiwagi, Yosuke, Narisetty, Chaitanya, Watanabe, Shinji

论文摘要

尽管接受了大量配对的音频 - 文本数据，但端到端的自动语音识别受到对未知目标域语音的适应性。最近的研究将模型的语言偏见视为内部语言模型（LM）。为了有效适应目标域，在推理过程中从后部减去内部LM，并与外部目标域LM融合。但是，这种融合使内部LM的推论和估计复杂化可能并不总是准确。在本文中，我们提出了一种简单的外部LM Fusion方法，用于域适应，该方法考虑了其训练中的内部LM估计。我们直接对外部和内部LM的残留因子进行建模，即残留LM。为了稳定地训练残留的LM，我们提出平滑估计的内部LM并通过跨透镜和均方误差损失的结合来优化它，这些损失考虑了目标域数据中内部LM的统计行为。我们通过实验证实，在大多数跨域和域内场景中，提出的残留LM的性能优于内部LM估计。

End-to-end automatic speech recognition suffers from adaptation to unknown target domain speech despite being trained with a large amount of paired audio--text data. Recent studies estimate a linguistic bias of the model as the internal language model (LM). To effectively adapt to the target domain, the internal LM is subtracted from the posterior during inference and fused with an external target-domain LM. However, this fusion complicates the inference and the estimation of the internal LM may not always be accurate. In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training. We directly model the residual factor of the external and internal LMs, namely the residual LM. To stably train the residual LM, we propose smoothing the estimated internal LM and optimizing it with a combination of cross-entropy and mean-squared-error losses, which consider the statistical behaviors of the internal LM in the target domain data. We experimentally confirmed that the proposed residual LM performs better than the internal LM estimation in most of the cross-domain and intra-domain scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题