负框架在以自我为中心的视觉查询2D本地化中很重要

论文标题

负框架在以自我为中心的视觉查询2D本地化中很重要

Negative Frames Matter in Egocentric Visual Query 2D Localization

论文作者

Xu, Mengmeng, Fu, Cheng-Yang, Li, Yanghao, Ghanem, Bernard, Perez-Rua, Juan-Manuel, Xiang, Tao

论文摘要

最近发布的EGO4D数据集和基准明显扩展了第一人称视觉感知数据。在EGO4D中，视觉查询2D本地化任务旨在从第一人称视图中的录制中检索过去出现的对象。此任务需要一个系统才能在空间和时间上定位给定对象查询的最新外观，其中查询在不同场景中被对象的单个紧密视觉作物注册。我们的研究基于情节记忆基准中引入的三阶段基线。基线通过检测和跟踪解决问题：检测所有帧中的相似对象，然后从最自信的检测结果中运行跟踪器。在VQ2D挑战中，我们确定了当前基线的两个局限性。（1）训练配置具有冗余计算。尽管训练集有数百万个实例，但其中大多数是重复的，唯一对象的数量仅为14.6k。相同对象的重复梯度计算导致效率低下的训练；（2）背景框架上的假正率很高。这是由于培训和评估之间的分布差距。在培训期间，该模型只能看到干净，稳定且标记的框架，但是以自我为中心的视频也具有嘈杂，模糊或未标记的背景框架。为此，我们开发了一个更有效的解决方案。具体而言，我们将训练环从约15天提高到不到24小时，并且达到0.17％的时空AP，比基线高31％。我们的解决方案在公共排行榜上获得了第一个排名。我们的代码可在https://github.com/facebookresearch/vq2d_cvpr上公开获取。

The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is registered by a single tight visual crop of the object in a different scene. Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark. The baseline solves the problem by detection and tracking: detect the similar objects in all the frames, then run a tracker from the most confident detection result. In the VQ2D challenge, we identified two limitations of the current baseline. (1) The training configuration has redundant computation. Although the training set has millions of instances, most of them are repetitive and the number of unique object is only around 14.6k. The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames. This is due to the distribution gap between training and evaluation. During training, the model is only able to see the clean, stable, and labeled frames, but the egocentric videos also have noisy, blurry, or unlabeled background frames. To this end, we developed a more efficient and effective solution. Concretely, we bring the training loop from ~15 days to less than 24 hours, and we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline. Our solution got the first ranking on the public leaderboard. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题