多演讲者端到端ASR的扩展图形分类

论文标题

多演讲者端到端ASR的扩展图形分类

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

论文作者

Chang, Xuankai, Moritz, Niko, Hori, Takaaki, Watanabe, Shinji, Roux, Jonathan Le

论文摘要

最近提出了基于图形的时间分类（GTC）是连接派时间分类损失的广义形式，用于使用基于图的监督来改善自动语音识别（ASR）系统。例如，GTC首先用于将伪标签序列的N最佳列表编码为半监督学习的图。在本文中，我们提出了GTC的扩展，以通过神经网络对标签和标签过渡的后代进行建模，该网络可以应用于更广泛的任务范围。作为示例应用程序，我们将扩展GTC（GTC-E）用于多演讲者语音识别任务。多演讲者语音的转录和扬声器信息由图表示，在该图中，说话者信息与节点的过渡和ASR输出相关联。使用GTC-E，多扬声器ASR建模与单扬声器ASR建模非常相似，因为以时间顺序，多个扬声器的代币被认为是单个合并序列。为了进行评估，我们在源自LiblisPeech的模拟多演讲数据集上执行实验，从而获得了有希望的结果，并且在任务接近经典基准的性能。

Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题