ICASSP 2022多渠道多方会议转录（M2MET）挑战的USTC-Ximalaya系统

论文标题

ICASSP 2022多渠道多方会议转录（M2MET）挑战的USTC-Ximalaya系统

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

论文作者

He, Maokui, Lv, Xiang, Zhou, Weilin, Yin, JingJing, Zhang, Xiaoqi, Wang, Yuxuan, Niu, Shutong, Cao, Yuhang, Lu, Heng, Du, Jun, Lee, Chin-Hui

论文摘要

我们提出了对目标扬声器语音活动检测（TS-VAD）的两种改进，这是我们提出的扬声器诊断系统中的核心组件，该系统已提交给2022多渠道多方会议转录（M2MET）挑战。这些技术旨在在现实世界中的扬声器越过比率高的现实世界会议场景中处理多演讲者的对话，并且在浓重的混响和嘈杂的情况下。首先，对于培训TS-VAD模型中的数据准备和增强，使用了包含真实会议和模拟室内对话的语音数据。其次，在基于TS-VAD的解码后获得的完善结果时，我们执行了一系列后处理步骤，以改善降低诊断误差率所需的VAD结果（DERS）。在Alimeeting语料库中测试了新发布的M2MET中使用的普通话会议数据集，我们证明，与基于经典的clustering诊断相比，我们所提出的系统可以相对较低66.55/60.59％的降低66.55/60.59％。

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题