论文标题
ATCO2语料库:一个大型数据集,用于研究自动语音识别和自然语言对空中交通管制通信的理解
ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications
论文作者
论文摘要
在我们相互联系的数字世界中,个人助理,自动语音认可和对话理解系统变得越来越重要。一个明显的例子是空中交通管制(ATC)通信。 ATC旨在以安全,最佳的方式引导飞机和控制空域。这些基于语音的对话是通过高频无线电通道之间的空中交通管制员(ATCO)和飞行员之间进行的。为了将这些新技术纳入ATC(低资源域),需要大规模注释的数据集来开发数据驱动的AI系统。两个例子是自动语音识别(ASR)和自然语言理解(NLU)。在本文中,我们介绍了ATCO2语料库,该数据集旨在促进对充满挑战的ATC领域的研究,由于缺乏带注释的数据,该数据集落后于挑战性的ATC领域。 ATCO2语料库涵盖1)数据收集和预处理,2)语音数据的伪注销,以及3)提取与ATC相关的命名实体的提取。 ATCO2语料库分为三个子集。 1)ATCO2检验库语料库包含4个小时的ATC语音,带有手动成绩单,并带有带有命名实体识别的金注释的子集(Callign,命令,命令,值)。 2)ATCO2-PL-SET语料库由5281小时的未标记的ATC数据组成,该数据富含来自域中的语音识别器,上下文信息,说话者转向信息,信噪比估计值和英语语言检测分数每个样本的自动成绩单。两者都可以通过http://catalog.elra.info/en-us/repository/browse/browse/elra-s0484购买。 3)ATCO2测试-Set-1H语料库是原始测试集语料库的一个小时子集,我们可以在https://www.atco2.org/data上免费提供。我们预计,ATCO2语料库将不仅在ATC通信领域,而且在一般研究社区中促进有关强大的ASR和NLU的研究。
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.