大规模的对抗性培训，用于视觉和语言表示

论文标题

大规模的对抗性培训，用于视觉和语言表示

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

论文作者

Gan, Zhe, Chen, Yen-Chun, Li, Linjie, Zhu, Chen, Cheng, Yu, Liu, Jingjing

论文摘要

我们提出了Villa，这是对视觉和语言（V+L）表示学习的大规模对抗训练的首次已知努力。别墅由两个培训阶段组成：（i）任务不足的对手预训练；其次是（ii）特定于任务的对手填充。我们建议在每种模态的嵌入空间中执行对抗性训练，而不是在图像像素和文本令牌上添加对抗性扰动。为了实现大规模训练，我们采用了“免费”对抗训练策略，并将其与基于KL-Divergence的正规化结合使用，以促进嵌入空间中的更高不变性。我们将别墅应用于当前表现最佳的V+L模型，并在各种任务上实现新的最新技术，包括视觉问题回答，视觉常识性推理，图像文本检索，引用表达理解，视觉效果和NLVR2。

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题