论文标题
基于组的分割的统一变压器框架:共段,共同检测和视频显着对象检测
A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
论文作者
论文摘要
自从我们生活在一个充满活力的世界中以来,人类倾向于通过从一组图像或几帧视频中学习来挖掘物体。在计算机视觉领域,许多研究集中于共细分(COS),共同检测(COSD)和视频显着对象检测(VSOD),以发现共恒流对象。但是,以前的方法分别在这些类似的任务上设计了不同的网络,它们很难彼此应用,这降低了深度学习框架的可传递性的上限。此外,他们无法充分利用一组图像中功能间和功能之间的提示。在本文中,我们引入了一个统一的框架来解决这些问题,称为UFO(统一的共同对象细分框架)。具体而言,我们首先引入一个变压器块,该块将图像特征视为贴片令牌,然后通过自我发挥机制捕获其远程依赖性。这可以帮助网络挖掘相关对象之间的结构相似性。此外,我们提出了一个MLP内学习模块,以产生自我掩盖以增强网络以避免部分激活。对四个COS基准(Pascal,Icoseg,Internet和MSRC),三个COSD基准(COSAL2015,COSOD3K和可口可乐)以及四个VSOD基准(Davis16,FBMS,Visal和Segv2)进行的四个COSD基准(COSAL2015,COSOD3K和可口可乐)进行了广泛的实验,我们的方法都可以在其他情况下使用三个不同的构建,即在其他情况下可以使用三个不同的构建,并在三个不同的网络上都可以使用三个不同的架构,并且实时fps。
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researches focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object detection (VSOD) to discover the co-occurrent objects. However, previous approaches design different networks on these similar tasks separately, and they are difficult to apply to each other, which lowers the upper bound of the transferability of deep learning frameworks. Besides, they fail to take full advantage of the cues among inter- and intra-feature within a group of images. In this paper, we introduce a unified framework to tackle these issues, term as UFO (Unified Framework for Co-Object Segmentation). Specifically, we first introduce a transformer block, which views the image feature as a patch token and then captures their long-range dependencies through the self-attention mechanism. This can help the network to excavate the patch structured similarities among the relevant objects. Furthermore, we propose an intra-MLP learning module to produce self-mask to enhance the network to avoid partial activation. Extensive experiments on four CoS benchmarks (PASCAL, iCoseg, Internet and MSRC), three CoSD benchmarks (Cosal2015, CoSOD3k, and CocA) and four VSOD benchmarks (DAVIS16, FBMS, ViSal and SegV2) show that our method outperforms other state-of-the-arts on three different tasks in both accuracy and speed by using the same network architecture , which can reach 140 FPS in real-time.