通过变压器模型流动自动语音识别

论文标题

通过变压器模型流动自动语音识别

Streaming automatic speech recognition with the transformer model

论文作者

Moritz, Niko, Hori, Takaaki, Roux, Jonathan Le

论文摘要

基于编码器 - 模型的序列到序列模型已证明了最新的端到端自动语音识别（ASR）。最近，与基于复发性神经网络（RNN）的系统体系结构相比，使用自我注意来对时间上下文信息进行建模的变压器体系结构已显示出明显较低的单词错误率（WERS）。尽管它成功了，但实际用法仅限于离线ASR任务，因为编码器架构通常需要整个语音话语作为输入。在这项工作中，我们提出了一个基于变压器的端到端ASR系统用于流媒体ASR，在每个口语单词之后必须在不久后生成输出。为了实现这一目标，我们对编码器应用时间限制的自我注意力，并引发了编码器注意机制的关注。我们提出的流媒体变压器体系结构可为“清洁”和“其他”测试数据实现2.8％和7.2％，据我们所知，这是该任务的最佳发表流端到端ASR结果。

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题