对视频序列的无监督对象表示为基准测试

论文标题

对视频序列的无监督对象表示为基准测试

Benchmarking Unsupervised Object Representations for Video Sequences

论文作者

Weis, Marissa A., Chitta, Kashyap, Sharma, Yash, Brendel, Wieland, Bethge, Matthias, Geiger, Andreas, Ecker, Alexander S.

论文摘要

从物体上感知世界并在时间上跟踪它们是推理和场景理解的关键先决条件。最近，已经提出了几种方法，用于学习以对象为中心的表示。但是，由于这些模型在不同的下游任务上进行了评估，因此尚不清楚它们如何从基本感知能力（例如检测，图形进行分割和对象跟踪）方面进行比较。为了缩小这一差距，我们设计了一个基准，具有四个数据集的复杂性，还有七个其他测试集，其中包含与自然视频相关的具有挑战性的跟踪方案。使用此基准测试，我们比较了四种以对象为中心的方法的感知能力：基于经常出现的空间关注的Vimon（Vimon）的视频扩展，OP3，通过空间混合模型以及TBA和Scalor来利用聚类，以及通过空间变压器进行显着分解。我们的结果表明，与基于空间变压器的架构相比，具有无约束潜在表示的体系结构在对象检测，细分和跟踪方面学习更强大的表示。我们还观察到，尽管它们的合成性质，但这些方法都无法优雅地处理最具挑战性的跟踪方案，这表明我们的基准可能会为学习更强大的以对象的视频表示提供富有成果的指导。

Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题