用于多语言扬声器识别的分解表示学习

论文标题

用于多语言扬声器识别的分解表示学习

Disentangled representation learning for multilingual speaker recognition

论文作者

Nam, Kihyun, Kim, Youkyum, Huh, Jaesung, Heo, Hee Soo, Jung, Jee-weon, Chung, Joon Son

论文摘要

本文的目的是学习双语演讲场景的强大演讲者代表。世界大多数人口至少说两种语言。但是，大多数说话者识别系统在用不同语言讲话时无法识别同一说话者。流行的演讲者识别评估集不考虑双语情景，因此很难分析双语演讲者对说话者识别性能的影响。在本文中，我们发布了一套名为Voxceleb1-b的大规模评估集，这些评估是从Voxceleb衍生而来的，该评估集认为是双语场景。我们引入了一种有效的分解学习策略，该策略结合了基于对抗性和度量的方法。这种方法通过将与语言相关的信息与说话者表示，同时确保稳定的说话者表示学习，从而解决了双语情况。我们的语言触发性学习方法仅使用没有手动信息的语言伪标签。

The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when speaking in different languages. Popular speaker recognition evaluation sets do not consider the bilingual scenario, making it difficult to analyse the effect of bilingual speakers on speaker recognition performance. In this paper, we publish a large-scale evaluation set named VoxCeleb1-B derived from VoxCeleb that considers bilingual scenarios. We introduce an effective disentanglement learning strategy that combines adversarial and metric learning-based methods. This approach addresses the bilingual situation by disentangling language-related information from speaker representation while ensuring stable speaker representation learning. Our language-disentangled learning method only uses language pseudo-labels without manual information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题