以对话为导向的ASR与CBS多样化的CBS架构

论文标题

以对话为导向的ASR与CBS多样化的CBS架构

Conversation-oriented ASR with multi-look-ahead CBS architecture

论文作者

Zhao, Huaibo, Fujie, Shinya, Ogawa, Tetsuji, Sakuma, Jin, Kida, Yusuke, Kobayashi, Tetsunori

论文摘要

在对话期间，人类能够在演讲的任何时候推断说话者的意图，以立即准备以下行动。这种能力也是对话系统实现节奏和自然对话的关键。为了执行此操作，用于实时转录语音的自动语音识别（ASR）必须毫不延迟才能达到高精度。在流媒体ASR中，通过参加观察帧可以确保高精度，这会导致延迟增量。为了解决这个权衡问题，我们提出了多个延迟流ASR，以实现高准确性，而零浏览量。所提出的系统包含两个并行操作的编码器，其中主编码器会生成精确的输出，利用look-aead框架，辅助编码器识别主编码器的look-aphead部分而没有look-ahead。提出的系统是基于上下文块流（CBS）体系结构构建的，该体系结构利用块处理，对多个延迟体系结构具有很高的亲和力。还研究了用于架构系统的各种方法，包括将网络转移到以不同的编码为单位。以及在一个编码通道中生成两个编码器的输出。

During conversations, humans are capable of inferring the intention of the speaker at any point of the speech to prepare the following action promptly. Such ability is also the key for conversational systems to achieve rhythmic and natural conversation. To perform this, the automatic speech recognition (ASR) used for transcribing the speech in real-time must achieve high accuracy without delay. In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. The proposed system contains two encoders that operate in parallel, where a primary encoder generates accurate outputs utilizing look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion of the primary encoder without look-ahead. The proposed system is constructed based on contextual block streaming (CBS) architecture, which leverages block processing and has a high affinity for the multiple latency architecture. Various methods are also studied for architecting the system, including shifting the network to perform as different encoders; as well as generating both encoders' outputs in one encoding pass.

下载PDF全文

下载文献需遵守相关版权规定

论文标题