使用自动分段的肥皂剧语音的五种语言代码切换的ASR的半监督声学建模

论文标题

使用自动分段的肥皂剧语音的五种语言代码切换的ASR的半监督声学建模

Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

论文作者

Wilkinson, N., Biswas, A., Yılmaz, E., de Wet, F., van der Westhuizen, E., Niesler, T. R.

论文摘要

本文考虑了自动分割对自动语音识别（ASR）系统完全自动，半监督培训（ASR）系统对五种语言代码转换（CS）语音的影响。根据半监督方式对所得细分训练的ASR系统的识别性能评估了四种自动分割技术。将系统的输出与通过手动分配的细分训练的半监督系统实现的识别率进行了比较。三种自动技术使用新提出的卷积神经网络（CNN）模型进行框架分类，并包括CNN输出的HMM平滑形式的新形式。自动分割与自动扬声器诊断结合使用。在没有说话者诊断的情况下，还测试了表现最佳的分割技术。基于248个未分段的肥皂剧发作的评估表明，基于CNN的语音活动检测（VAD），然后是高斯混合物Mode-Hidden Modelhdide Markov模型平滑（CNN-GMM-HMM）可产生最佳的ASR性能。经过培训的半监督系统与通过手动创建的细分细分市场训练的系统相比，绝对的总体提高了1.1％。此外，我们发现，当自动分割与说话者诊断时，系统性能进一步改善。

This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recognition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms of the recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. The system's output was compared with the recognition rates achieved by a semi-supervised system trained on manually assigned segments. Three of the automatic techniques use a newly proposed convolutional neural network (CNN) model for framewise classification, and include a novel form of HMM smoothing of the CNN outputs. Automatic segmentation was applied in combination with automatic speaker diarization. The best-performing segmentation technique was also tested without speaker diarization. An evaluation based on 248 unsegmented soap opera episodes indicated that voice activity detection (VAD) based on a CNN followed by Gaussian mixture modelhidden Markov model smoothing (CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system trained with the resulting segments achieved an overall WER improvement of 1.1% absolute over the system trained with manually created segments. Furthermore, we found that system performance improved even further when the automatic segmentation was used in conjunction with speaker diarization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题