基于CIF的协作解码，用于端到端上下文语音识别

论文标题

基于CIF的协作解码，用于端到端上下文语音识别

CIF-based Collaborative Decoding for End-to-end Contextual Speech Recognition

论文作者

Han, Minglun, Dong, Linhao, Zhou, Shiyu, Xu, Bo

论文摘要

端到端（E2E）模型已在多个语音识别基准上取得了有希望的结果，并显示出成为主流的潜力。但是，统一的结构和E2E培训障碍将上下文信息注入上下文，以实现上下文偏见。尽管上下文LAS（CLAS）提供了出色的全神经解决方案，但对给定上下文信息的偏见程度并不能明确控制。在本文中，我们专注于将上下文信息纳入基于连续的集成与火灾（CIF）模型，该模型以更可控制的方式支持上下文偏见。具体而言，引入了一个额外的上下文处理网络，以提取上下文嵌入，集成了声学相关的上下文信息并解码上下文输出分布，从而与基于CIF的模型的解码器形成了协作解码。对HKUST/AISHELL-2的指定实体评估集进行了评估，我们的方法带来了相对性格错误率（CER）减少8.83％/21.13％，相对命名的实体字符错误率（NE-CER）降低了40.14％/51.50％，而基线与强基线相比。此外，它可以使原始评估集的性能保持不变。

End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is not explicitly controllable. In this paper, we focus on incorporating context information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion. Specifically, an extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution, thus forming a collaborative decoding with the decoder of the CIF-based model. Evaluated on the named entity rich evaluation sets of HKUST/AISHELL-2, our method brings relative character error rate (CER) reduction of 8.83%/21.13% and relative named entity character error rate (NE-CER) reduction of 40.14%/51.50% when compared with a strong baseline. Besides, it keeps the performance on original evaluation set without degradation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题