Supperoice：在人类语音中使用超声波能量的独立扬声器验证

论文标题

Supperoice：在人类语音中使用超声波能量的独立扬声器验证

SuperVoice: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech

论文作者

Guo, Hanqing, Yan, Qiben, Ivanov, Nikolay, Zhu, Ying, Xiao, Li, Hunter, Eric J.

论文摘要

语音激活的系统集成到各种台式机，移动设备（IoT）设备中。但是，欺骗攻击的声音，例如模仿和重播攻击，在这种攻击中，恶意攻击者综合了受害者的声音或简单地重播了攻击，这引起了人们日益严重的担忧。现有的扬声器验证技术通过从声音频率范围的语音命令中提取的光谱学特征来区分单个扬声器。但是，它们通常具有较高的错误率和/或延迟延迟。在本文中，我们通过审查超声频带中人类语音的独特特征来探讨人类语音研究的新方向。我们的研究表明，从20至48 kHz的高频超声组件（例如语音摩擦剂）可以显着提高说话者验证的安全性和准确性。我们建议使用扬声器验证系统，即使用具有功能融合机制的两流DNN体系结构来生成独特的扬声器模型。为了测试系统，我们创建了一个来自127名参与者的音频（8,950个语音样本）的语音数据集。此外，我们创建了第二个欺骗的语音数据集来评估其安全性。为了在受控录音和现实世界应用之间平衡，通过8个不同的记录设备，包括7个智能手机和一个超声麦克风从两个安静的房间收集录音。我们的评估表明，Supperoice在说话者验证任务中达到0.58％的误差率，仅需120毫秒即可测试传入的话语，从而超过了所有现有的说话者验证系统。此外，在91毫秒的处理时间内，Supperoice在检测5个不同的扬声器发射的重播攻击时达到了0％的错误率。

Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices. However, voice spoofing attacks, such as impersonation and replay attacks, in which malicious attackers synthesize the voice of a victim or simply replay it, have brought growing security concerns. Existing speaker verification techniques distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands. However, they often have high error rates and/or long delays. In this paper, we explore a new direction of human voice research by scrutinizing the unique characteristics of human speech at the ultrasound frequency band. Our research indicates that the high-frequency ultrasound components (e.g. speech fricatives) from 20 to 48 kHz can significantly enhance the security and accuracy of speaker verification. We propose a speaker verification system, SUPERVOICE that uses a two-stream DNN architecture with a feature fusion mechanism to generate distinctive speaker models. To test the system, we create a speech dataset with 12 hours of audio (8,950 voice samples) from 127 participants. In addition, we create a second spoofed voice dataset to evaluate its security. In order to balance between controlled recordings and real-world applications, the audio recordings are collected from two quiet rooms by 8 different recording devices, including 7 smartphones and an ultrasound microphone. Our evaluation shows that SUPERVOICE achieves 0.58% equal error rate in the speaker verification task, it only takes 120 ms for testing an incoming utterance, outperforming all existing speaker verification systems. Moreover, within 91 ms processing time, SUPERVOICE achieves 0% equal error rate in detecting replay attacks launched by 5 different loudspeakers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题