芬兰议会ASR语料库 - 分析，基准和统计数据

论文标题

芬兰议会ASR语料库 - 分析，基准和统计数据

Finnish Parliament ASR corpus - Analysis, benchmarks and statistics

论文作者

Virkkunen, Anja, Rouhe, Aku, Phan, Nhan, Kurimo, Mikko

论文摘要

诸如议会会议记录和成绩单之类的公共资料为自动语音识别（ASR）系统的培训和评估提供了不断增长的材料。在本文中，我们发布和分析了芬兰议会ASR Copus，这是最大的公开可用的手动转录语音数据集合，用于芬兰语，其中有3000多个小时的语音和449位扬声器，并为其提供丰富的人群元数据。该语料库建立在早期的初始工作的基础上，因此，该语料库从两个时期开始自然分为两个培训子集。同样，有两个官方校正的测试集，涵盖了不同时间，设置了具有纵向分配偏移特征的ASR任务。还提供了官方的开发集。我们开发了一个完整的基于Kaldi的数据制备管道，隐藏的Markov模型（HMM），混合深神经网络（HMM-DNN）和基于注意力的编码器数据（AED）ASR配方。我们在官方测试集以及其他最近使用的测试集上设置了基准测试。两个颞族子集都已经很大，我们观察到，超出其规模，ASR在官方测试集上的性能，而其他域则受益于附加数据。在经过精心匹配的同等数据设置中比较HMM-DNN和AED方法，HMM-DNN系统的性能始终如一。最后，比较了议会元数据中可用的说话者类别之间的ASR准确性的变化，以根据性别，年龄和教育等因素来检测潜在的偏见。

Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish parliament ASR corpus, the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes. We set benchmarks on the official test sets, as well as multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, ASR performance on the official test sets plateaus, whereas other domains benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.

下载PDF全文

下载文献需遵守相关版权规定

论文标题