与加强学习的多模式知识对齐

论文标题

与加强学习的多模式知识对齐

Multimodal Knowledge Alignment with Reinforcement Learning

论文作者

Yu, Youngjae, Chung, Jiwan, Yun, Heeseung, Hessel, Jack, Park, JaeSung, Lu, Ximing, Ammanabrolu, Prithviraj, Zellers, Rowan, Bras, Ronan Le, Kim, Gunhee, Choi, Yejin

论文摘要

大型语言模型即使没有特定于任务的培训数据也很容易适应新颖的设置。可以将其零射击能力扩展到多模式输入吗？在这项工作中，我们提出了ESPER，它扩展了仅语言的零击模型，以看不见的多模式任务，例如图像和音频字幕。我们的主要新颖性是使用强化学习将多模式输入与语言模型世代保持一致：例如，在图像情况下，我们的奖励优化仅依赖于从剪辑中得出的余弦相似性，因此不需要其他明确配对（图像，字幕）数据。由于语言模型的参数保持不变，因此该模型保持其零弹性概括的能力。实验表明，ESPER的表现优于基准和各种零击任务的先前工作。其中包括我们收集+发行版ESP DataSet的新基准测试，该数据集由为每个图像生成几个多样性的字幕而进行任务模型。

Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we propose ESPER which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from CLIP, and thus requires no additional explicitly paired (image, caption) data. Because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect+release, ESP dataset, which tasks models with generating several diversely-styled captions for each image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题