流端到端多对话者语音识别

论文标题

流端到端多对话者语音识别

Streaming end-to-end multi-talker speech recognition

论文作者

Lu, Liang, Kanda, Naoyuki, Li, Jinyu, Gong, Yifan

论文摘要

端到端的多对话者语音识别是语音社区的一种新兴研究趋势，因为它在对话和会议上的抄写等应用中的巨大潜力。据我们所知，所有现有的研究工作在离线情况下都受到限制。在这项工作中，我们提出了端到端多态度语音识别的流媒体不混合和识别传感器（SURT）。我们的模型采用经常性的神经网络传感器（RNN-T）作为可以满足各种延迟约束的骨干。我们研究了分别基于说话者划分的编码器和掩模编码器的两个不同模型体系结构。为了训练该模型，我们研究了广泛使用的置换不变训练（PIT）方法和启发式错误分配训练（热）方法。基于对公开可用的LibrisPeechMix数据集的实验，我们表明，与PIT相比，热量可以达到更好的准确性，而使用150毫秒的Surt模型则可以优惠地与基于序列的基础基线模型相比，相比之下。

End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. We study two different model architectures that are based on a speaker-differentiator encoder and a mask encoder respectively. To train this model, we investigate the widely used Permutation Invariant Training (PIT) approach and the Heuristic Error Assignment Training (HEAT) approach. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT, and the SURT model with 150 milliseconds algorithmic latency constraint compares favorably with the offline sequence-to-sequence based baseline model in terms of accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题