磁铁：自然语言查询在短语层面的多区域注意辅助基础

论文标题

磁铁：自然语言查询在短语层面的多区域注意辅助基础

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

论文作者

Shrestha, Amar, Pugdeethosapol, Krittaphat, Fang, Haowen, Qiu, Qinru

论文摘要

接地自由形式的文本查询需要了解这些文本短语及其与视觉提示的关系，以可靠地了解所描述的位置。已知空间注意力网络可以学习这种关系，并将目光集中在图像中的显着对象上。因此，我们建议利用空间注意力网络来保存局部（Word）和全局（短语）信息，以使用网络内部区域建议网络（RPN）来改进区域建议，并检测单个或多个区域的短语查询。我们仅专注于与数据集的约束，即其他属性，上下文等的模型的词组查询 - 地面真相对（参考表达式），用于此类引用表达数据集参考游戏，我们的多区域关注辅助接地网络（Magnet）在12 \％的改进中，超过了整个国家。如果没有图像标题的上下文和flickr30k实体中的属性信息，我们仍然与最先进的结果相比取得了竞争成果。

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12\% improvement over the state-of-the-art. Without the context from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题