Longfnt：具有分解的神经传感器的长形语音识别

论文标题

Longfnt：具有分解的神经传感器的长形语音识别

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

论文作者

Gong, Xun, Wu, Yu, Li, Jinyu, Liu, Shujie, Zhao, Rui, Chen, Xie, Qian, Yanmin

论文摘要

传统的自动语音识别〜（ASR）系统通常专注于单个话语，而无需考虑使用有用的历史信息的长期语音，这在实际情况下更为实用。仅参加香草神经传感器模型的较长的转录历史记录在我们的初步实验中没有太大的收益，因为预测网络不是纯语言模型。这促使我们利用包含真实语言模型（词汇预测指标）的分解的神经传感器结构。我们提出了{longfnt-Text}体系结构，该体系结构将句子级的长形特征直接与词汇预测指标的输出融合在一起，然后在词汇表预测指标内嵌入令牌级的长格式特征，并与预训练的上下文编码器Roberta融合在一起，以进一步促进性能。此外，我们通过将长形语音扩展到原始语音输入并实现最佳性能来提出{longfnt}架构。我们的Longfnt方法的有效性在Librispeech和GigAspeech Corpora上得到了验证，分别为19％和12％的相对单词错误率〜（WER）降低。

Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题