论文标题
NIST CTS演讲者认可挑战
The NIST CTS Speaker Recognition Challenge
论文作者
论文摘要
自2020年8月以来,美国国家标准技术研究所(NIST)一直在CTS挑战进行第二次迭代。CTS挑战的当前迭代是使用从呼叫的未接触部分中提取的电话数据(CMN2)(CMN2)和Multi-Language Speece(MLS(MLS)Corputa corted the Leccuts the L. ldc cornection the Call''的式发言人识别评估。 CTS挑战目前以与SRE19 CTS挑战相似的方式组织,仅使用两个评估子集提供开放训练条件,即进步和测试。与SRE19挑战赛不同,最初没有培训或开发设置,NIST已公开发布CTS挑战的两个子集的排行榜。试验属于的哪个子集(即进度或测试)不知道挑战参与者,并且每个系统提交都需要包含所有试验的输出。 CTS挑战也已服务,并将继续这样做,这是进入常规SRE(例如SRE21)的先决条件。自2020年8月以来,来自学术界和行业的53个组织(组成33个团队)参加了CTS挑战赛,并提交了4400多个有效的系统输出。本文概述了CTS挑战中某些主要条件的评估和系统性能的几个分析。迄今为止,CTS挑战结果表明,由于1)使用大规模和复杂的神经网络体系结构提取的扬声器嵌入性能的显着改善,例如重置以及扬声器的角边缘损失,用于扬声器嵌入提取的扬声器,2)大量数据增强,3)大量内部内部的专用数据的使用,从大量的内置数据中使用了大量的LabEleLeprun seplyer,4)long emepere seplyer,4),4)。
The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS challenge since August 2020. The current iteration of the CTS Challenge is a leaderboard-style speaker recognition evaluation using telephony data extracted from the unexposed portions of the Call My Net 2 (CMN2) and Multi-Language Speech (MLS) corpora collected by the LDC. The CTS Challenge is currently organized in a similar manner to the SRE19 CTS Challenge, offering only an open training condition using two evaluation subsets, namely Progress and Test. Unlike in the SRE19 Challenge, no training or development set was initially released, and NIST has publicly released the leaderboards on both subsets for the CTS Challenge. Which subset (i.e., Progress or Test) a trial belongs to is unknown to challenge participants, and each system submission needs to contain outputs for all of the trials. The CTS Challenge has also served, and will continue to do so, as a prerequisite for entrance to the regular SREs (such as SRE21). Since August 2020, a total of 53 organizations (forming 33 teams) from academia and industry have participated in the CTS Challenge and submitted more than 4400 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge. The CTS Challenge results thus far indicate remarkable improvements in performance due to 1) speaker embeddings extracted using large-scale and complex neural network architectures such as ResNets along with angular margin losses for speaker embedding extraction, 2) extensive data augmentation, 3) the use of large amounts of in-house proprietary data from a large number of labeled speakers, 4) long-duration fine-tuning.