论文标题

具有大量视频数据的ASR系统的弱监督构建

Weakly Supervised Construction of ASR Systems with Massive Video Data

论文作者

Cheng, Mengli, Wang, Chengyu, Hu, Xu, Huang, Jun, Wang, Xiaobo

论文摘要

从头开始建造自动语音识别(ASR)系统非常具有挑战性,这主要是由于用成绩单注释大量音频数据的耗时且经济廉价的过程。尽管已经提出了几种无监督的预训练模型,但如果更贴有标签,则直接应用此类模型仍然是次优的,但可以在没有巨大成本的情况下获得培训数据。在本文中,我们提出了一个弱监督的框架,用于使用大量的视频数据构建ASR系统。由于视频通常包含与字幕一致的人言音频,因此我们将视频视为重要的知识来源,并提出了一种有效的方法来提取与基于光学特征识别(OCR)的视频的成绩单相一致的高质量音频。可以对基础ASR模型进行微调,以适应弱监督的预训练后,适合任何特定领域的目标训练数据集。广泛的实验表明,我们的框架可以轻松地在六个公共数据集上产生最新的结果,以识别普通话。

Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源