E2E分割在两通级别的编码器ASR模型中

论文标题

E2E分割在两通级别的编码器ASR模型中

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

论文作者

Huang, W. Ronny, Chang, Shuo-Yiin, Sainath, Tara N., He, Yanzhang, Rybach, David, David, Robert, Prabhavalkar, Rohit, Allauzen, Cyril, Peyser, Cal, Strohman, Trevor D.

论文摘要

我们将带有两通级别的编码器ASR的神经细分器探索统一，以单个模型为单个模型。一个关键的挑战是允许分段（实时运行，与解码器同步运行）可以在不引入用户感知的延迟或删除错误的情况下最终确定第二次通过（实时落后900毫秒）。我们提出了一种设计，其中神经分段器与因果第一通道解码器集成在一起，以实时发射段末端（EOS）信号。然后，EOS信号用于最终确定非毒物第二通过。我们试验了最终确定第二通过的不同方法，并发现一种新型的虚拟框架注入策略允许同时获得高质量的第二通过结果和较低的最终确定延迟。在现实世界的长形式字幕任务（YouTube）上，我们在基于基准VAD的细分器上获得了2.4％的相对WER和140 ms EOS潜伏期的增长，并具有相同的级联编码器。

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

下载PDF全文

下载文献需遵守相关版权规定

论文标题