基于无监督的信息瓶颈的新颖体系结构会议的扬声器诊断

论文标题

基于无监督的信息瓶颈的新颖体系结构会议的扬声器诊断

Novel Architectures for Unsupervised Information Bottleneck based Speaker Diarization of Meetings

论文作者

Dawalatabad, Nauman, Madikeri, Srikanth, Sekhar, C. Chandra, Murthy, Hema A.

论文摘要

扬声器诊断是一个重要的问题，是主题的重要问题，并且特别有用，作为对话与语音相关的应用程序的预处理器。本文的目的是两个方面：（i）通过在整个初始段中统一分发说话者信息的段初始化，以及（ii）在无监督的诊断框架中纳入说话者的判别特征。在工作的第一部分中，建议使用音素速率作为信息瓶颈（IB）的扬声器诊断系统的长度段初始化技术，并提出了侧面信息。该初始化在整个细分市场中均匀分发了扬声器信息，并为基于IB的聚类提供了更好的起点。在工作的第二部分中，我们提出了一个基于两次通行的信息瓶颈（TPIB）的扬声器诊断系统，该系统在诊断过程中结合了扬声器的判别特征。基于TPIB的扬声器诊断系统已显示出基于基线IB的系统的改进。在TPIB系统的第一个通过期间，使用基于IB的聚类进行粗分割。所获得的比对用于使用浅馈送神经网络和线性判别分析来生成扬声器判别特征。获得的判别特征用于第二次通过以获得最终的扬声器边界。在本文的最后一部分中，可变段初始化与TPIB框架结合使用。这利用了更好的片段初始化和说话者判别特征的优势，从而导致性能的进一步提高。对标准会议数据集的评估表明，在NIST和AMI数据集上，绝对改善分别为3.9％和4.7％。

Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this paper is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题