论文标题
参考表达式的弱监督分割
Weakly-supervised segmentation of referring expressions
论文作者
论文摘要
视觉接地将与给定的表达式相对应的图像中定位区域(框或段)。在这项工作中,我们从转介表达式中解决了图像分割,这一问题迄今仅在完全监督的环境中解决了。但是,完全监督的设置需要按像素进行监督,并且在手动注释的费用下很难扩展。因此,我们引入了一项新的任务,该任务是从参考表达式中进行弱监督的图像分割,并提出了文本接地语义剪接(TSEG),该语义隔离(TSEG)直接从没有像素级注释的情况下直接从图像级的参考表达式中学习分割掩码。我们的基于变压器的方法计算补丁文本相似性,并使用新的多标签贴片分配机制来指导训练期间的分类目标。所得的视觉接地模型段图像区域与给定的自然语言表达相对应。我们的方法TSEG证明了在具有挑战性的短语和Refcoco数据集上弱监督的参考表达分割的有希望的结果。当在零拍设置中评估Pascal VOC的语义细分时,TSEG还显示出竞争性能。
Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.