论文标题
少于几个:自我摄影的视频实例细分
Less than Few: Self-Shot Video Instance Segmentation
论文作者
论文摘要
本文的目的是在运行时绕过几次视频理解中标记的示例的需求。尽管被证明有效,但在许多实际的视频设置中,甚至标记了一些示例似乎是不现实的。这尤其如此,因为时空视频理解中的细节水平以及借助它的复杂性不断增加。我们没有使用人类甲骨文(Human Oracle)进行几次学习的学习来提供一些密集标签的支持视频,而是建议自动学习找到给定查询的适当支持视频。我们称这种自动学习,并概述了一种简单的自我监督学习方法,以生成一个非常适合无监督检索相关样本的嵌入空间。为了展示这种新颖的设置,我们首次解决了自动射击(且几乎没有拍摄)设置的视频实例分割,其目标是在空间和时间域中的像素级别处分段。我们提供了强大的基线性能,该表现利用了一种新型的基于变压器的模型,并表明自动学习甚至可以超过几次,并且可以积极地组合以获得进一步的性能提高。新基准测试的实验表明,我们的方法可以在某些情况下获得强大的性能,在甲骨文的支持下具有竞争力,并缩放到大型未标记的视频集合,并且可以在半监督的设置中组合。
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.