Hausa Visual Genome：用于多模式英语到Hausa机器翻译的数据集

论文标题

Hausa Visual Genome：用于多模式英语到Hausa机器翻译的数据集

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

论文作者

Abdulmumin, Idris, Dash, Satya Ranjan, Dawud, Musa Abdullahi, Parida, Shantipriya, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Sa'id, Panda, Subhadarshi, Bojar, Ondřej, Galadanci, Bashir Shehu, Bello, Bello Shehu

论文摘要

多模式机器翻译（MMT）可实现视觉信息来增强翻译的质量。视觉信息可以用作有价值的上下文信息，以减少输入句子的歧义。尽管这种技术的普及越来越多，但良好和庞大的数据集还是很少的，从而限制了它们的全部潜力。夏萨（Hausa）是一种chadic语言，是非洲亚洲语言家族的成员。据估计，大约有100至1.5亿人会说这种语言，其中有超过8000万本土著人。这比其他任何chadic语言都多。尽管有大量的演讲者，但Hausa语言被认为是自然语言处理（NLP）的低资源。这是由于没有足够的资源来实施大多数NLP任务。尽管存在一些数据集，但它们要么是稀缺，机器生成的，要么是宗教领域。因此，有必要创建用于实施机器学习任务的培训和评估数据并弥合语言的研究差距。这项工作介绍了Hausa Visual Genome（HAVG），该数据集包含图像的描述或豪萨图像中图像中的部分及其在英语中的等效。为了准备数据集，我们首先将对印地语视觉基因组（HVG）中图像的英文描述自动转换为Hausa。之后，考虑到各自的图像，仔细编辑了合成的HAUSA数据。该数据集包含32,923张图像及其描述，这些图像分为培训，开发，测试和挑战测试集。 Hausa Visual Genome是同类数据集的第一个数据集，可用于Hausa-English机器翻译，多模式研究和图像描述，以及其他各种自然语言处理和发电任务。

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题