论文标题
大规模的对抗性培训,用于视觉和语言表示
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
论文作者
论文摘要
我们提出了Villa,这是对视觉和语言(V+L)表示学习的大规模对抗训练的首次已知努力。别墅由两个培训阶段组成:(i)任务不足的对手预训练;其次是(ii)特定于任务的对手填充。我们建议在每种模态的嵌入空间中执行对抗性训练,而不是在图像像素和文本令牌上添加对抗性扰动。为了实现大规模训练,我们采用了“免费”对抗训练策略,并将其与基于KL-Divergence的正规化结合使用,以促进嵌入空间中的更高不变性。我们将别墅应用于当前表现最佳的V+L模型,并在各种任务上实现新的最新技术,包括视觉问题回答,视觉常识性推理,图像文本检索,引用表达理解,视觉效果和NLVR2。
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.