论文标题

eNtialspoof数据库和对策用于检测简短的虚假语音段

The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

论文作者

Zhang, Lin, Wang, Xin, Cooper, Erica, Evans, Nicholas, Yamagishi, Junichi

论文摘要

自动扬声器验证容易受到各种操纵和欺骗的影响,例如文本到语音综合,语音转换,重播,篡改,对抗性攻击等。我们考虑了一种新的欺骗场景,称为“部分欺骗”(PS),其中合成或转化的语音段嵌入到真正的话语中。尽管现有的对策(CMS)可以检测到完全欺骗的话语,但需要对PS情况进行适应或扩展。我们提出了各种改进,以构建一个更准确的CM,可以在更精细的时间分辨率下检测和定位短生成的欺骗语音段。首先,我们将新开发的自我监管的预训练模型作为增强功能提取器。其次,我们通过为各种时间分辨率添加片段标签来扩展部分SPOOF数据库。由于攻击者嵌入的简短欺骗语音段的长度可变,因此考虑了六种不同的时间分辨率,范围从短达20 ms到640毫秒。第三,我们提出了一个新的CM,该CM可以同时在不同的时间分辨率上同时使用片段级别标签,以及发出级别的标签,以同时执行语音和细分级别的检测。我们还表明,所提出的CM能够在PS情况下以及相关的逻辑访问(LA)方案中检测出欺骗级别的欺骗水平。在部分SPOOF数据库和ASVSPOOF 2019 LA数据库上的话语级检测的同样错误率分别为0.77和0.90%。

Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully spoofed utterances, there is a need for their adaptation or extension to the PS scenario. We propose various improvements to construct a significantly more accurate CM that can detect and locate short-generated spoofed speech segments at finer temporal resolutions. First, we introduce newly developed self-supervised pre-trained models as enhanced feature extractors. Second, we extend our PartialSpoof database by adding segment labels for various temporal resolutions. Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time. We also show that the proposed CM is capable of detecting spoofing at the utterance level with low error rates in the PS scenario as well as in a related logical access (LA) scenario. The equal error rates of utterance-level detection on the PartialSpoof database and ASVspoof 2019 LA database were 0.77 and 0.90%, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源