Vokeatization：通过上下文化的，视觉的监督改善语言理解

论文标题

Vokeatization：通过上下文化的，视觉的监督改善语言理解

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

论文作者

Tan, Hao, Bansal, Mohit

论文摘要

人类通过聆听，说话，写作，阅读以及与多模式现实世界的互动来学习语言。现有的语言预训练框架显示了本文探讨视觉监督语言模型的想法时，仅文本自我审视的有效性。我们发现，阻碍这种探索的主要原因是视觉上的语言数据集和纯语言语料库之间的大小和分布的差异很大。因此，我们开发了一种名为“ Vokenization”的技术，该技术通过将语言令牌映射到其相关图像（我们称为“ vokens”）来推断多模式对齐到仅语言数据。 “ Vokenizer”在相对较小的图像字幕上进行了培训，然后我们将其应用于为大语言语料库生成Vokens。经过这些上下文生成的Vokens培训，我们的视觉监督语言模型在多个纯语言任务（例如胶水，小队和赃物）上表现出对自我监督的替代方案的一致改进。 https://github.com/airsplay/vokenatization公开可用的代码和预培训模型

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization

下载PDF全文

下载文献需遵守相关版权规定

论文标题