论文标题

通过循环关系共识来稳健地引用视频对象细分

Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus

论文作者

Li, Xiang, Wang, Jinglu, Xu, Xiaohao, Li, Xiao, Raj, Bhiksha, Lu, Yan

论文摘要

引用视频对象细分(R-VOS)是一项艰巨的任务,旨在基于语言表达式在视频中细分对象。大多数现有的R-VOS方法都有一个关键的假设:所引用的对象必须出现在视频中。在现实情况下,我们称之为语义共识,通常被违反,在现实情况下,该表达可能会针对错误的视频进行查询。在这项工作中,我们强调了可以处理语义不匹配的强大R-VOS模型的需求。因此,我们提出了一个称为健壮R-VOS的扩展任务,该任务接受未配对的视频文本输入。我们通过共同建模主要的R-VOS问题及其双重(文本重建)来解决这个问题。引入了结构性文本到文本周期的约束,以区分视频文本对之间的语义共识,并以正面对强加,从而从正面和负对中实现多模式对齐。我们的结构约束有效地解决了语言多样性所带来的挑战,从而克服了依赖于点限制的先前方法的局限性。一个新的评估数据集,r \ textsuperscript {2} -youtube-vosis,构建了用于测量模型鲁棒性的构建。我们的模型在R-VOS基准,Ref-Davis17和Ref-Youtube-VOS以及我们的R \ TextSuperscript {2} -Youtube-Vos〜DataSet上实现了最新性能。

Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as semantic consensus, is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, R\textsuperscript{2}-Youtube-VOSis constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our R\textsuperscript{2}-Youtube-VOS~dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源