MAXM：朝着多语言视觉问题回答

论文标题

MAXM：朝着多语言视觉问题回答

MaXM: Towards Multilingual Visual Question Answering

论文作者

Changpinyo, Soravit, Xue, Linting, Yarom, Michal, Thapliyal, Ashish V., Szpektor, Idan, Amelot, Julien, Chen, Xi, Soricut, Radu

论文摘要

视觉问题回答（VQA）主要通过英语镜头进行了研究。但是，以其他方式处理其他语言的VQA将需要大量资源。在本文中，我们在数据和建模方面提出了可扩展的解决方案（MVQA）。我们首先向MVQA数据生成提出了一个基于翻译的框架，该框架比直接收集问题和答案的常规方法要少得多。然后，我们将框架应用于CrossModal-3600数据集中的多语言字幕，并开发了有效的注释协议来创建MAXM，Maxm是7种不同语言的仅测试VQA基准。最后，我们开发了一种简单，轻巧，有效的方法，以及基准的最先进的英语和多语言VQA模型。我们希望我们的基准鼓励对MVQA的进一步研究。

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, lightweight, and effective approach as well as benchmark state-of-the-art English and multilingual VQA models. We hope that our benchmark encourages further research on mVQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题