E2E细分器：长格式ASR的关节分段和解码

论文标题

E2E细分器：长格式ASR的关节分段和解码

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

论文作者

Huang, W. Ronny, Chang, Shuo-yiin, Rybach, David, Prabhavalkar, Rohit, Sainath, Tara N., Allauzen, Cyril, Peyser, Cal, Lu, Zhiyun

论文摘要

在长时间到数小时的长时间话语中，提高端到端ASR模型的性能是语音识别的持续挑战。一个常见的解决方案是使用单独的语音活动检测器（VAD）事先将音频分割，该声音活动检测器（VAD）纯粹基于声音/非语音信息来决定段边界位置。但是，VAD细分器可能是现实世界语音的最佳选择，例如，一个完整的句子总体上可能包含中间的犹豫（“设置... 5点钟的警报”）。我们建议用端到端的ASR模型替换VAD，该模型能够以流方式预测段边界，从而使细分决策不仅可以在更好的声学特征上，而且还可以根据解码文本的语义特征进行调节，并具有可忽略的额外计算。在长度长达30分钟的现实世界长音频（YouTube）的实验中，与最先进的构型构象像其备RNN-T模型相比，我们证明了相对改善的8.5％相对改善和中位端末期潜伏期的250 ms降低。

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题