将结构化的愿景和语言概念教导到视觉和语言模型

论文标题

将结构化的愿景和语言概念教导到视觉和语言模型

Teaching Structured Vision&Language Concepts to Vision&Language Models

论文作者

Doveh, Sivan, Arbelle, Assaf, Harary, Sivan, Panda, Rameswar, Herzig, Roei, Schwartz, Eli, Kim, Donghyun, Giryes, Raja, Feris, Rogerio, Ullman, Shimon, Karlinsky, Leonid

论文摘要

视觉和语言（VL）模型在各种任务中都表现出了显着的零击性能。但是，复杂的语言理解的某些方面仍然是一个挑战。我们介绍了结构化视觉和语言概念（SVLC）的集体概念，该概念包括文本中存在并在图像中可见的对象属性，关系和状态。最近的研究表明，即使是最佳VL模型也与SVLC斗争。解决此问题的一种可能方法是收集专门的数据集来教每种SVLC类型，但这可能昂贵且耗时。取而代之的是，我们提出了一种更优雅的数据驱动方法，以增强VL模型对SVLC的理解，从而更有效地利用现有的VL预训练数据集，并且不需要任何其他数据。虽然对图像结构的自动理解仍然在很大程度上尚未解决，但语言结构的建模和理解更好，从而使其在教学VL模型中有效利用。在本文中，我们提出了基于语言结构理解的各种技术，这些技术可用于操纵现成配对VL数据集的文本部分。经过更新数据训练的VL模型在SVLC的理解中显示出高达15％的SVLC的显着提高，并且在从头开始训练或微调预训练的模型时，仅在零摄像的功能中轻度降解。

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题