ATCO2语料库：一个大型数据集，用于研究自动语音识别和自然语言对空中交通管制通信的理解

论文标题

ATCO2语料库：一个大型数据集，用于研究自动语音识别和自然语言对空中交通管制通信的理解

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

论文作者

Zuluaga-Gomez, Juan, Veselý, Karel, Szöke, Igor, Blatt, Alexander, Motlicek, Petr, Kocour, Martin, Rigault, Mickael, Choukri, Khalid, Prasad, Amrutha, Sarfjoo, Seyyed Saeed, Nigmatulina, Iuliia, Cevenini, Claudia, Kolčárek, Pavel, Tart, Allan, Černocký, Jan, Klakow, Dietrich

论文摘要

在我们相互联系的数字世界中，个人助理，自动语音认可和对话理解系统变得越来越重要。一个明显的例子是空中交通管制（ATC）通信。 ATC旨在以安全，最佳的方式引导飞机和控制空域。这些基于语音的对话是通过高频无线电通道之间的空中交通管制员（ATCO）和飞行员之间进行的。为了将这些新技术纳入ATC（低资源域），需要大规模注释的数据集来开发数据驱动的AI系统。两个例子是自动语音识别（ASR）和自然语言理解（NLU）。在本文中，我们介绍了ATCO2语料库，该数据集旨在促进对充满挑战的ATC领域的研究，由于缺乏带注释的数据，该数据集落后于挑战性的ATC领域。 ATCO2语料库涵盖1）数据收集和预处理，2）语音数据的伪注销，以及3）提取与ATC相关的命名实体的提取。 ATCO2语料库分为三个子集。 1）ATCO2检验库语料库包含4个小时的ATC语音，带有手动成绩单，并带有带有命名实体识别的金注释的子集（Callign，命令，命令，值）。 2）ATCO2-PL-SET语料库由5281小时的未标记的ATC数据组成，该数据富含来自域中的语音识别器，上下文信息，说话者转向信息，信噪比估计值和英语语言检测分数每个样本的自动成绩单。两者都可以通过http://catalog.elra.info/en-us/repository/browse/browse/elra-s0484购买。 3）ATCO2测试-Set-1H语料库是原始测试集语料库的一个小时子集，我们可以在https://www.atco2.org/data上免费提供。我们预计，ATCO2语料库将不仅在ATC通信领域，而且在一般研究社区中促进有关强大的ASR和NLU的研究。

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题