mmft-bert：带有伯特编码的多模式融合变压器，用于视觉问题

论文标题

mmft-bert：带有伯特编码的多模式融合变压器，用于视觉问题

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

论文作者

Khan, Aisha Urooj, Mazaheri, Amir, Lobo, Niels da Vitoria, Shah, Mubarak

论文摘要

我们提出mmft-bert（带有BERT编码的多模式融合变压器），以求解视觉问题答案（VQA），以确保单个和混合处理多个输入方式。我们的方法受益于处理多模式数据（视频和文本）单独采用BERT编码，并使用基于变压器的新型融合方法将它们融合在一起。我们的方法将不同的模式来源分解为具有相似体系结构但重量可变的不同BERT实例。这将在TVQA数据集上实现SOTA结果。此外，我们提供TVQA-Visual，这是TVQA的孤立诊断子集，严格要求根据人类注释者的判断力了解视觉（V）模式。这组问题有助于我们研究模型的行为以及TVQA对防止超级人类绩效的挑战。广泛的实验显示了我们方法的有效性和优势。

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题