深扬声器嵌入腹泻的组合

论文标题

深扬声器嵌入腹泻的组合

Combination of Deep Speaker Embeddings for Diarisation

论文作者

Sun, Guangzhi, Zhang, Chao, Woodland, Phil

论文摘要

在引入D-VECTORS作为从神经网络（NN）说话者分类器中提取的说话者嵌入以用于聚类语音段的说话者的嵌入后，最近在说话者诊断中取得了重大进展。为了提取表现更好，更健壮的扬声器嵌入，本文提出了一种C-vector方法，通过组合来自具有不同NN组件的系统的多组互补D-VECTOR。三个结构用于实现C向量，即2D自动，封闭式添加剂和双线性池结构，分别依赖于注意机制，门控机制和低级别的双线性池机制。此外，本文还提出了一种基于神经的单次扬声器诊断管道，该管道使用NNS来实现语音活动检测，扬声器更改点检测以及嵌入提取的扬声器。实验和详细分析是对具有挑战性的AMI和NIST RT05数据集进行的，这些数据集由4---10位讲话者和广泛的声学条件组成。对于在AMI训练集中训练的系统，通过在AMI DEV上使用C矢量而不是D-Qutector和评估集获得了相对说话者错误率（SER）降低13％和29％，并且在RT05上观察到SER中的15％相对降低15％，这表明了提议方法的鲁棒性。通过将voxceleb数据纳入训练集中，与AMI DEV，EDAR和RT05集的D-vector相比，最佳的C矢量系统相比分别达到7％，17％和16％的相对Ser降低。

Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4--10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and16% relative SER reduction compared to the d-vector on the AMI dev, eval, and RT05 sets respectively

下载PDF全文

下载文献需遵守相关版权规定

论文标题