论文标题
YFACC:通过视觉接地的跨语性关键字本地化的Yorùbá语音图像数据集
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding
论文作者
论文摘要
视觉接地的语音(VGS)模型在与未标记的口语标题配对的图像上进行了训练。这样的模型可用于在无法获得标记数据的设置中构建语音系统,例如用于记录不成文的语言。但是,大多数VGS研究都使用英语或其他高资源语言。本文试图解决这一缺点。我们收集并发布了Yorùbá中6K Flickr图像的音频字幕的新单扬声器数据集 - 尼日利亚使用的一种真正的低资源语言。我们训练一个基于注意力的VGS模型,其中图像会自动用英语视觉标签标记,并与Yorùbá的话语配对。这使得跨语性关键字本地化:检测到书面英文查询并位于Yorùbá演讲中。为了量化较小数据集的效果,我们将其与接受类似数据和更多数据培训的英语系统进行比较。我们希望这个新的数据集将刺激使用VGS模型用于实际低资源语言的研究。
Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yorùbá -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yorùbá utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yorùbá speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more data. We hope that this new dataset will stimulate research in the use of VGS models for real low-resource languages.