论文标题

视觉和语言变形金刚学习扎根的谓词名词依赖性吗?

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

论文作者

Nikolaus, Mitja, Salin, Emmanuelle, Ayache, Stephane, Fourtassi, Abdellah, Favre, Benoit

论文摘要

视觉和语言建模的最新进展已经看到了变压器体系结构的发展,这些结构在多模式推理任务上取得了显着性能。然而,这些黑盒模型的确切功能仍然鲜为人知。尽管以前的许多工作都集中在研究单词级别学习含义的能力,但他们跟踪单词之间的句法依赖性的能力受到了较少的关注。我们通过创建一个针对受控设置中谓词通知依赖性的新的多模式任务来解决差距的第一步。我们评估了一系列最先进的模型,并发现其在任务上的性能差异很大,某些模型的性能相对较好,而其他模型则在机会级别上。为了解释这种可变性,我们的分析表明,预处理数据的质量(不仅是数量)至关重要。此外,最佳性能模型除了标准的图像文本匹配目标外,还利用了细粒度的多模式预处理目标。这项研究突出了针对和受控评估的研究是对视觉和语言模型多模式知识进行精确和严格测试的关键步骤。

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源