论文标题
图像和视频的全景分割的通才框架
A Generalist Framework for Panoptic Segmentation of Images and Videos
论文作者
论文摘要
Panoptic分割将语义和实例ID标签分配给图像的每个像素。由于实例ID的排列也是有效的解决方案,因此该任务需要学习高维一对多映射。结果,最先进的方法使用定制的体系结构和特定于任务的损失功能。我们在不依赖任务的归纳偏见的情况下将综合分割作为离散的数据生成问题。提出了一个扩散模型,以模拟具有简单的体系结构和通用损耗函数的模型。通过简单地将过去的预测添加为条件信号,我们的方法能够对视频进行建模(在流设置中),从而学会自动跟踪对象实例。通过广泛的实验,我们证明了我们的简单方法可以在类似设置的最先进的专家方法中竞争性能。
Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.