情节记忆问题回答

论文标题

情节记忆问题回答

Episodic Memory Question Answering

论文作者

Datta, Samyak, Dharur, Sameer, Cartillier, Vincent, Desai, Ruta, Khanna, Mukul, Batra, Dhruv, Parikh, Devi

论文摘要

以人为佩戴者为中心的增强现实设备，例如可穿戴眼镜，可以被动地捕获视觉数据，因为人类佩戴者会游览家庭环境。我们设想了一种场景，其中人类通过提出问题来与AI代理通信（例如，您上次看到我的钥匙在哪里？）。为了在这项任务中取得成功，以Egentric AI助手必须（1）构建语义上丰富而有效的场景记忆，这些记忆编码有关游览过程中看到的对象的时空信息，并且（2）具有理解问题并将其答案融入语义记忆表示的能力。 Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an同类，自上而下的语义特征图，并将问题纳入地图中以本地化答案。我们表明，我们选择的情节场景记忆的表现优于幼稚，现成的解决方案，以及许多非常有竞争力的基线，并且在深度上对噪音，姿势以及摄像头抖动都具有强大的态度。可以在以下网址找到：https：//samyak-268.github.io/emqa。

Egocentric augmented reality devices such as wearable glasses passively capture visual data as a human wearer tours a home environment. We envision a scenario wherein the human communicates with an AI agent powering such a device by asking questions (e.g., where did you last see my keys?). In order to succeed at this task, the egocentric AI assistant must (1) construct semantically rich and efficient scene memories that encode spatio-temporal information about objects seen during the tour and (2) possess the ability to understand the question and ground its answer into the semantic memory representation. Towards that end, we introduce (1) a new task - Episodic Memory Question Answering (EMQA) wherein an egocentric AI assistant is provided with a video sequence (the tour) and a question as an input and is asked to localize its answer to the question within the tour, (2) a dataset of grounded questions designed to probe the agent's spatio-temporal understanding of the tour, and (3) a model for the task that encodes the scene as an allocentric, top-down semantic feature map and grounds the question into the map to localize the answer. We show that our choice of episodic scene memory outperforms naive, off-the-shelf solutions for the task as well as a host of very competitive baselines and is robust to noise in depth, pose as well as camera jitter. The project page can be found at: https://samyak-268.github.io/emqa .

下载PDF全文

下载文献需遵守相关版权规定

论文标题