在视觉对话中使用代词扩展短语接地

论文标题

在视觉对话中使用代词扩展短语接地

Extending Phrase Grounding with Pronouns in Visual Dialogues

论文作者

Lu, Panzhong, Zhang, Xin, Zhang, Meishan, Zhang, Min

论文摘要

传统的短语接地旨在将给定标题中提到的名词短语定位到其相应的图像区域，这最近取得了巨大的成功。显然，唯一的名词短语接地不足以用于跨模式的视觉语言理解。在这里，我们也通过考虑代词来扩展任务。首先，我们构建了一个用名词短语和代词到图像区域的短语接地数据集。基于数据集，我们通过使用该行的最先进的文献模型来测试短语接地的性能。然后，我们使用核心方面的信息增强了基线接地模型，这些模型应该有助于我们的任务潜在地，并使用图形卷积网络对核心结构进行建模。有趣的是，我们数据集中的实验表明，代词比名词短语更容易接地，在这种情况下，可能的原因可能是这些代词不太模棱两可。此外，我们使用核心信息信息的最终模型可以显着提高名词短语和代词的接地性能。

Conventional phrase grounding aims to localize noun phrases mentioned in a given caption to their corresponding image regions, which has achieved great success recently. Apparently, sole noun phrase grounding is not enough for cross-modal visual language understanding. Here we extend the task by considering pronouns as well. First, we construct a dataset of phrase grounding with both noun phrases and pronouns to image regions. Based on the dataset, we test the performance of phrase grounding by using a state-of-the-art literature model of this line. Then, we enhance the baseline grounding model with coreference information which should help our task potentially, modeling the coreference structures with graph convolutional networks. Experiments on our dataset, interestingly, show that pronouns are easier to ground than noun phrases, where the possible reason might be that these pronouns are much less ambiguous. Additionally, our final model with coreference information can significantly boost the grounding performance of both noun phrases and pronouns.

下载PDF全文

下载文献需遵守相关版权规定

论文标题