混合带宽语音数据的统一的深扬声器嵌入框架

论文标题

混合带宽语音数据的统一的深扬声器嵌入框架

A Unified Deep Speaker Embedding Framework for Mixed-Bandwidth Speech Data

论文作者

Cai, Weicheng, Li, Ming

论文摘要

本文提出了一个统一的深扬声器嵌入框架，用于以不同的采样率对语音数据进行建模。考虑到窄带频谱图作为宽带光谱图的子图像，我们以图像分类方式解决了混合带宽数据的关节建模问题。从这个角度来看，我们在不同的培训和测试数据方案下阐述了几种混合带宽联合培训策略。所提出的系统能够在单个扬声器嵌入模型中灵活处理混合宽度的语音数据，而无需任何额外的降采样，上采样，带宽扩展或填充操作。我们对Voxceleb1数据集进行了广泛的实验研究。此外，SITW和NIST SRE 2016数据集验证了拟议方法的有效性。

This paper proposes a unified deep speaker embedding framework for modeling speech data with different sampling rates. Considering the narrowband spectrogram as a sub-image of the wideband spectrogram, we tackle the joint modeling problem of the mixed-bandwidth data in an image classification manner. From this perspective, we elaborate several mixed-bandwidth joint training strategies under different training and test data scenarios. The proposed systems are able to flexibly handle the mixed-bandwidth speech data in a single speaker embedding model without any additional downsampling, upsampling, bandwidth extension, or padding operations. We conduct extensive experimental studies on the VoxCeleb1 dataset. Furthermore, the effectiveness of the proposed approach is validated by the SITW and NIST SRE 2016 datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题