TVLT：无文字视觉变压器

论文标题

TVLT：无文字视觉变压器

TVLT: Textless Vision-Language Transformer

论文作者

Tang, Zineng, Cho, Jaemin, Nie, Yixin, Bansal, Mohit

论文摘要

在这项工作中，我们介绍了无文本视觉语言变压器（TVLT），其中均匀的变压器块以最小的模态设计进行视觉和语言表示的原始视觉和音频输入，并且不使用文本特异性模块，例如代币化或自动语音识别（ASR）。 TVLT是通过重建连续视频帧和音频谱图（蒙版自动编码）和对比度建模以使视频和音频的对比度建模的掩蔽贴片进行培训。 TVLT的性能可与基于文本的各种多模式任务相提并论，例如视觉问题回答，图像检索，视频检索和多模式情感分析，并具有28倍的推理速度和仅1/3参数。我们的发现表明，从低级视觉和音频信号中学习紧凑，有效的视觉语言表示的可能性，而无需假设文本的先前存在。我们的代码和检查点可在以下网址找到：https：//github.com/zinengtang/tvlt

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering, image retrieval, video retrieval, and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text. Our code and checkpoints are available at: https://github.com/zinengtang/TVLT

下载PDF全文

下载文献需遵守相关版权规定

论文标题