目标扬声器提取的层次扬声器代表

论文标题

目标扬声器提取的层次扬声器代表

Hierarchical speaker representation for target speaker extraction

论文作者

He, Shulin, Zhang, Huaiwen, Rao, Wei, Zhang, Kanghao, Ju, Yukai, Yang, Yang, Zhang, Xueliang

论文摘要

目标扬声器的提取旨在将特定扬声器的声音与多个声音源的复合材料隔离，在注册语音或称为锚点的指导下。当前的方法主要是从锚点得出扬声器的嵌入，并将它们集成到分隔网络中，以分离目标扬声器的声音。但是，说话者嵌入的表示过于简单，通常只是1*1024向量。这些密集的信息使得分离网络很难有效利用。为了解决这一限制，我们引入了一种名为层次表示（HR）的开创性方法，该方法无缝地融合了分离网络的颗粒状和总体5层的锚固数据，从而增强了目标提取的精度。人力资源扩大了锚固的功效以改善目标扬声器隔离。在Libri-2 Talker数据集上，人力资源基本上优于最先进的时频域技术。进一步证明了人力资源的能力，我们在著名的ICASSP 2023深噪声抑制挑战中获得了第一名。提出的人力资源方法学对通过增强的锚固利用来推进目标扬声器提取的巨大希望。

Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题