论文标题
双卷积LSTM网络,用于引用图像分割
Dual Convolutional LSTM Network for Referring Image Segmentation
论文作者
论文摘要
我们考虑参考图像分割。这是计算机视觉和自然语言理解的交集的问题。给定输入图像和以自然语言句子形式的引用表达式,目标是将语言查询所引用的图像中感兴趣的对象分割。为此,我们提出了双重卷积LSTM(ConvlstM)网络来解决此问题。我们的模型由一个编码器网络和解码器网络组成,该网络在编码器和解码器网络中都使用ConvlSTM来捕获空间和顺序信息。编码器网络在表达式句子中为每个单词的每个单词提取视觉和语言特征,并采用了注意机制,以专注于多模式相互作用中更有信息的单词。解码器网络集成了由编码器网络以多个级别作为输入而生成的功能,并产生最终的精确分割掩码。四个具有挑战性的数据集的实验结果表明,与其他最先进的方法相比,所提出的网络可实现出色的细分性能。
We consider referring image segmentation. It is a problem at the intersection of computer vision and natural language understanding. Given an input image and a referring expression in the form of a natural language sentence, the goal is to segment the object of interest in the image referred by the linguistic query. To this end, we propose a dual convolutional LSTM (ConvLSTM) network to tackle this problem. Our model consists of an encoder network and a decoder network, where ConvLSTM is used in both encoder and decoder networks to capture spatial and sequential information. The encoder network extracts visual and linguistic features for each word in the expression sentence, and adopts an attention mechanism to focus on words that are more informative in the multimodal interaction. The decoder network integrates the features generated by the encoder network at multiple levels as its input and produces the final precise segmentation mask. Experimental results on four challenging datasets demonstrate that the proposed network achieves superior segmentation performance compared with other state-of-the-art methods.