用于医学视觉和语言预训练的多模式蒙版自动编码器

论文标题

用于医学视觉和语言预训练的多模式蒙版自动编码器

Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training

论文作者

Chen, Zhihong, Du, Yuhao, Hu, Jinpeng, Liu, Yang, Li, Guanbin, Wan, Xiang, Chang, Tsung-Hui

论文摘要

医学视觉和语言预训练提供了一种可行的解决方案，可以从医学图像和文本中提取有效的视觉和语言表示。但是，很少有研究专门研究该领域，以促进医学视觉和语言理解。在本文中，我们提出了一个自我监督的学习范式，该学习范式使用多模式掩盖的自动编码器（M $^3 $ ae），通过从随机掩盖的图像和文本中重新构造缺失的像素和代币来学习跨模式域知识。有三个关键设计可以使这种简单的方法起作用。首先，考虑到视觉和语言的不同信息密度，我们对输入图像和文本采用不同的掩蔽比，在图像中使用了相当大的掩蔽率。其次，我们使用来自不同层的视觉和文本特征来执行重建，以处理视觉和语言中不同级别的抽象。第三，我们为视觉和语言解码器开发了不同的设计（即，视觉的变压器和语言的多层感知器）。为了进行全面的评估并促进进一步的研究，我们构建了包括三个任务的医学视觉和语言基准。实验结果证明了我们方法的有效性，在所有下游任务上都取得了最新的结果。此外，我们进行了进一步的分析，以更好地验证方法的不同组成部分和预训练的各种设置的有效性。源代码可在〜\ url {https://github.com/zhjohnchan/m3ae}中获得。

Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at~\url{https://github.com/zhjohnchan/M3AE}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题