查看，阅读和问：通过阅读图像中的文本来学习问问题

论文标题

查看，阅读和问：通过阅读图像中的文本来学习问问题

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

论文作者

Jahagirdar, Soumya, Gangisetty, Shankar, Mishra, Anand

论文摘要

简而言之，我们提出了一个基于文本的视觉问题产生或TextVQG的新问题。鉴于文档图像分析社区最近日益增长的兴趣将文本理解与会话人工智能相结合，例如，基于文本的视觉问题回答，TextVQG成为一项重要任务。 TextVQG旨在为给定的输入图像生成一个自然语言问题，并且从中也称为OCR令牌的自动提取的文本，使OCR令牌是对生成问题的答案。 TextVQG是对话代理的重要能力。但是，它具有挑战性，因为它需要对场景有深入的了解，以及具有图像中存在的文本在语义上桥接视觉内容的能力。为了解决textVQG，我们提出了一个OCR一致的视觉问题生成模型，该模型查看视觉内容，读取场景文本，并询问一个相关且有意义的自然语言问题。我们将我们提出的模型称为OLRA。我们对两个公共基准进行了对OLRA的广泛评估，并将其与基准进行了比较。我们的模型OLRA自动生成类似于手动策划的基于文本的视觉问题的问题。此外，我们在文本文献中通常使用的性能指标的基线方法极大地超过了基线方法。

We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question. TextVQG is an essential ability for a conversational agent. However, it is challenging as it requires an in-depth understanding of the scene and the ability to semantically bridge the visual content with the text present in the image. To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question. We refer to our proposed model as OLRA. We perform an extensive evaluation of OLRA on two public benchmarks and compare them against baselines. Our model OLRA automatically generates questions similar to the public text-based visual question answering datasets that were curated manually. Moreover, we significantly outperform baseline approaches on the performance measures popularly used in text generation literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题