论文标题

粘性语言大脑编码

Visio-Linguistic Brain Encoding

论文作者

Oota, Subba Reddy, Arora, Jashn, Rowtula, Vijay, Gupta, Manish, Bapi, Raju S.

论文摘要

启用有效的脑部计算机界面需要了解人脑在视觉,语言(或文本)等方式上如何编码刺激。大脑编码旨在在刺激下构建fMRI大脑活动。存在大量的神经编码模型,这些模型研究了针对单模式刺激的大脑编码:视觉(验证的CNN)或文本(验证的语言模型)。最近很少有论文获得了单独的视觉和文本表示模型,并使用简单的启发式方法进行了延迟融合。但是,以前的工作未能探索:(a)图像变压器模型在编码视觉刺激方面的有效性,以及(b)用于视觉和文本推理的联合参与性多模式建模。在本文中,我们系统地探讨了图像变压器(VIT,DEIT和BEIT)和多模式变压器(Visualbert,LXMERT和CLIP)的功效。在两个受欢迎的数据集Bold5000和Pereira上进行了广泛的实验,提供了以下见解。 (1)据我们所知,我们是第一个研究图像和多模式变压器对大脑编码的有效性的人。 (2)我们发现,一种多模式变压器Visualbert显着胜过先前提出的单模CNN,图像变压器以及其他先前提出的多模式模型,从而建立了新的先进技术。 Visio-Linuistic模型的至高无上的问题提出了一个问题,即在视觉区域中引起的响应是否受到语言处理的影响,即使在被动查看图像时也会受到语言处理的影响。未来的FMRI任务可以在适当的实验环境中验证此计算洞察力。

Enabling effective brain-computer interfaces requires understanding how the human brain encodes stimuli across modalities such as visual, language (or text), etc. Brain encoding aims at constructing fMRI brain activity given a stimulus. There exists a plethora of neural encoding models which study brain encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore: (a) the effectiveness of image Transformer models for encoding visual stimuli, and (b) co-attentive multi-modal modeling for visual and text reasoning. In this paper, we systematically explore the efficacy of image Transformers (ViT, DEiT, and BEiT) and multi-modal Transformers (VisualBERT, LXMERT, and CLIP) for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) To the best of our knowledge, we are the first to investigate the effectiveness of image and multi-modal Transformers for brain encoding. (2) We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源