基于变压器LVCSR的深层声学结构的自我混合注意解码器

论文标题

基于变压器LVCSR的深层声学结构的自我混合注意解码器

Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR

论文作者

Zhou, Xinyuan, Lee, Grandee, Yılmaz, Emre, Long, Yanhua, Liang, Jiaen, Li, Haizhou

论文摘要

变压器在自动语音识别中表现出令人印象深刻的表现。它使用具有自我注意力的编码器解码器结构来学习源输入的高级表示与目标输出的嵌入之间的关系。在本文中，我们提出了一种新型解码器结构，该结构具有带有深声结构（DAS）的自我混合注意解码器（SMAD），以改善基于变压器的LVCSR的声学表示。具体而言，我们引入了一种自我发挥的机制，以学习多层高声学结构，以进行多个声学抽象。我们还设计了一种混合注意机制，该机制可以在共享的嵌入空间中同时了解不同级别的声学抽象及其相应的语言信息之间的对齐。 AISHELL-1上的ASR实验表明，所提出的结构在DEV集合上获得了4.8％的CERS，在测试集上达到了5.1％，这是我们所知的最佳结果。

The Transformer has shown impressive performance in automatic speech recognition. It uses the encoder-decoder structure with self-attention to learn the relationship between the high-level representation of the source inputs and embedding of the target outputs. In this paper, we propose a novel decoder structure that features a self-and-mixed attention decoder (SMAD) with a deep acoustic structure (DAS) to improve the acoustic representation of Transformer-based LVCSR. Specifically, we introduce a self-attention mechanism to learn a multi-layer deep acoustic structure for multiple levels of acoustic abstraction. We also design a mixed attention mechanism that learns the alignment between different levels of acoustic abstraction and its corresponding linguistic information simultaneously in a shared embedding space. The ASR experiments on Aishell-1 shown that the proposed structure achieves CERs of 4.8% on the dev set and 5.1% on the test set, which are the best results obtained on this task to the best of our knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题