论文标题

大规模的对抗性培训,用于视觉和语言表示

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

论文作者

Gan, Zhe, Chen, Yen-Chun, Li, Linjie, Zhu, Chen, Cheng, Yu, Liu, Jingjing

论文摘要

我们提出了Villa,这是对视觉和语言(V+L)表示学习的大规模对抗训练的首次已知努力。别墅由两个培训阶段组成:(i)任务不足的对手预训练;其次是(ii)特定于任务的对手填充。我们建议在每种模态的嵌入空间中执行对抗性训练,而不是在图像像素和文本令牌上添加对抗性扰动。为了实现大规模训练,我们采用了“免费”对抗训练策略,并将其与基于KL-Divergence的正规化结合使用,以促进嵌入空间中的更高不变性。我们将别墅应用于当前表现最佳的V+L模型,并在各种任务上实现新的最新技术,包括视觉问题回答,视觉常识性推理,图像文本检索,引用表达理解,视觉效果和NLVR2。

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源