多阶段扬声器提取带有言语和框架级参考信号的提取

论文标题

多阶段扬声器提取带有言语和框架级参考信号的提取

Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

论文作者

Ge, Meng, Xu, Chenglin, Wang, Longbiao, Chng, Eng Siong, Dang, Jianwu, Li, Haizhou

论文摘要

说话者提取需要来自目标扬声器的样本语音作为参考。但是，注册长话长的演讲者是不切实际的。我们提出了一种扬声器提取技术，该技术在多个阶段进行，以充分利用简短的参考语音样本。在早期阶段提取的语音被用作晚期的参考语音。我们第一次使用框架级顺序嵌入嵌入作为目标扬声器的参考。这与传统的基于话语的扬声器嵌入参考背道而驰。此外，提出了一个信号融合方案，以将解码的信号组合为多个尺度，并自动学习。在WSJ0-2MIX及其嘈杂版本（Wham！和Whamr！）上进行的实验表明，SPEX ++始终优于其他最先进的基线。

Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题