论文标题
用于引用表达理解的自定进度的多层次跨模式相互作用建模
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
论文作者
论文摘要
作为视力语言任务中的一个重要且具有挑战性的问题,转介表达理解(REC)通常需要大量的视觉和语言方式的多核信息信息才能实现准确的推理。此外,由于视觉场景的多样性和语言表达式的变化,一些硬性示例的多透明信息比其他示例要多得多。如何从不同模式中汇总多元信息的信息并从硬性示例中提取丰富的知识在REC任务中至关重要。为了应对上述挑战,在本文中,我们提出了一个自定进度的多层跨模式互动建模框架,该框架通过网络结构和学习机制的创新来提高语言到视觉的定位能力。具体而言,我们设计了一个基于变压器的多元模式关注,该注意有效地利用了视觉和语言编码器中固有的多元信息。此外,考虑到较大的样本差异,我们提出了一个自定进度的样本信息学习,以适应性地增强包含丰富多元信息信息的样品的网络学习。所提出的框架在广泛使用的数据集(例如Refcoco,Refcoco+,refcocog和ReferitGame数据集)上的最先进方法表明了我们方法的有效性。
As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention, which effectively utilizes the inherent multi-grained information in visual and linguistic encoders. Furthermore, considering the large variance of samples, we propose a self-paced sample informativeness learning to adaptively enhance the network learning for samples containing abundant multi-grained information. The proposed framework significantly outperforms state-of-the-art methods on widely used datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets, demonstrating the effectiveness of our method.