基于CTC的流式流动语音识别的动态延迟，并使用Emformer

论文标题

基于CTC的流式流动语音识别的动态延迟，并使用Emformer

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

论文作者

Sun, Jingyu, Zhong, Guiping, Zhou, Dinghao, Li, Baoxiang

论文摘要

由于缺乏未来的上下文，经常看到流媒体自动语音识别模型与非流游模型的劣等性能。为了提高流媒体模型的性能并降低计算复杂性，使用有效的增强内存变压器块和动态延迟训练方法的帧级模型用于本文中流动自动语音识别。远程历史上下文存储在增强内存库中，以补充编码器中使用的有限历史上下文。钥匙和价值由缓存机制缓存，并重复使用下一个块以减少计算。之后，提出了一种动态的延迟训练方法，以同时获得更好的性能并支持低潜伏期推理。我们的实验是在基准960H LibrisPeech数据集上进行的。我们的模型的平均潜伏期为640ms，在测试清洁中的相对降低为6.0％，测试中的相对于截断的块块变压器的相对降低为3.0％。

An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an average latency of 640ms, our model achieves a relative WER reduction of 6.0% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题