对大规模自动语音识别的单调传感器的调查

论文标题

对大规模自动语音识别的单调传感器的调查

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

论文作者

Moritz, Niko, Seide, Frank, Le, Duc, Mahadeokar, Jay, Fuegen, Christian

论文摘要

流媒体端到端自动语音识别（ASR）的两个最流行的损失功能是RNN-TransDucer（RNN-T）和Connectionist暂时分类（CTC）。在这两种损耗类型之间，我们可以对单调RNN-T（Monornn-T）和最近提出的CTC样换能器（CTC-T）分类。单调传感器具有一些优势。首先，RNN-T可能会遭受失控的幻觉，该模型在不及时前进的情况下不断发射非符号。其次，单调传感器每次步骤完全消耗一个模型得分，因此与传统的基于FST的ASR解码器更兼容。但是，到目前为止，Monornn-T的准确性比RNN-T更差。不必这样：通过通过RNN-T的联合LAS训练或参数初始化进行训练，Monornn-T和CTC-T的表现均高于RNN-T。对于LibrisPeech和大规模内部数据集证明了这一点。

The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题