论文标题
双路径CNN,带有最大门控块,用于基于文本的人重新识别
Dual-path CNN with Max Gated block for Text-Based Person Re-identification
论文作者
论文摘要
基于文本的人重新识别(RE-ID)是视频监视中的重要任务,该任务包括从大型图像画廊中检索相应的人的图像。由于模态异质性,很难将视觉内容与文本描述直接匹配。一方面,文本嵌入不够歧视,这源自文本描述的高级抽象。另一方面,全球平均池(GAP)通常被用来暗中提取更通用或平滑的功能,但忽略了显着的本地特征,这对于跨模式匹配问题更为重要。考虑到这一点,提出了一种具有最大门控块(DCMG)的新型双路线CNN来提取歧视性词嵌入,并使视觉文本关联更加关注这两种模态的显着特征。该提出的框架基于两个与跨模式投影匹配(CMPM)损失共同优化的深残留CNN和跨模式投影分类(CMPC)损失,以将两个模态嵌入到关节特征空间中。首先,预先训练的语言模型BERT与卷积神经网络(CNN)结合使用,以在文本对图像匹配域中学习更好的单词嵌入。其次,应用了全局最大池(GMP)层,以使视觉特征更多地关注显着部分。为了进一步减轻最大隔音特征的噪声,提出了封闭式块(GB)来产生一个注意力图,该图形侧重于两种模态的有意义特征。最后,在基准数据集(Cuhk-Pedes)上进行了广泛的实验,其中我们的方法达到了55.81%的排名1分数,并胜过最先进的方法1.3%。
Text-based person re-identification(Re-id) is an important task in video surveillance, which consists of retrieving the corresponding person's image given a textual description from a large gallery of images. It is difficult to directly match visual contents with the textual descriptions due to the modality heterogeneity. On the one hand, the textual embeddings are not discriminative enough, which originates from the high abstraction of the textual descriptions. One the other hand,Global average pooling (GAP) is commonly utilized to extract more general or smoothed features implicitly but ignores salient local features, which are more important for the cross-modal matching problem. With that in mind, a novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings and make visual-textual association concern more on remarkable features of both modalities. The proposed framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching (CMPM) loss and cross-modal projection classification (CMPC) loss to embed the two modalities into a joint feature space. First, the pre-trained language model, BERT, is combined with the convolutional neural network (CNN) to learn better word embeddings in the text-to-image matching domain. Second, the global Max pooling (GMP) layer is applied to make the visual-textual features focus more on the salient part. To further alleviate the noise of the maxed-pooled features, the gated block (GB) is proposed to produce an attention map that focuses on meaningful features of both modalities. Finally, extensive experiments are conducted on the benchmark dataset, CUHK-PEDES, in which our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.