基于变压器的在线语音识别具有解码器末端自适应计算步骤

论文标题

基于变压器的在线语音识别具有解码器末端自适应计算步骤

Transformer-based Online Speech Recognition with Decoder-end Adaptive Computation Steps

论文作者

Li, Mohan, Zorila, Catalin, Doddipatla, Rama

论文摘要

基于变压器的端到端（E2E）自动语音识别（ASR）系统最近获得了广泛的普及，并且显示出基于许多ASR任务的重复结构的E2E模型。但是，像其他E2E模型一样，变压器ASR还需要完整的输入序列来计算编码器和解码器的注意力，从而增加了潜伏期并对在线ASR构成挑战。本文提出了解码器末端自适应计算步骤（DACS）算法，以解决延迟问题并促进在线ASR。提出的算法通过触发从编码器状态获得的置信度达到一定阈值后触发输出来流式传输变压器ASR的解码。与其他单调注意机制不同，该机制冒着访问每个输出步骤访问整个编码器状态的风险，本文将最大的外观步骤引入了DACS算法，以防止演讲结束太快。在我们的系统中采用了块状编码器来处理实时语音输入。拟议的在线变压器ASR系统已在华尔街期刊（WSJ）和Aishell-1数据集上进行了评估，分别产生5.5％的单词错误率（WER）和7.1％的字符错误率（CER），与离线系统相比，性能仅较小。

Transformer-based end-to-end (E2E) automatic speech recognition (ASR) systems have recently gained wide popularity, and are shown to outperform E2E models based on recurrent structures on a number of ASR tasks. However, like other E2E models, Transformer ASR also requires the full input sequence for calculating the attentions on both encoder and decoder, leading to increased latency and posing a challenge for online ASR. The paper proposes Decoder-end Adaptive Computation Steps (DACS) algorithm to address the issue of latency and facilitate online ASR. The proposed algorithm streams the decoding of Transformer ASR by triggering an output after the confidence acquired from the encoder states reaches a certain threshold. Unlike other monotonic attention mechanisms that risk visiting the entire encoder states for each output step, the paper introduces a maximum look-ahead step into the DACS algorithm to prevent from reaching the end of speech too fast. A Chunkwise encoder is adopted in our system to handle real-time speech inputs. The proposed online Transformer ASR system has been evaluated on Wall Street Journal (WSJ) and AIShell-1 datasets, yielding 5.5% word error rate (WER) and 7.1% character error rate (CER) respectively, with only a minor decay in performance when compared to the offline systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题