论文标题
MAFNET:用于RGB-T人群计数的多发融合网络
MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting
论文作者
论文摘要
RGB-thermal(RGB-T)人群计数是一项具有挑战性的任务,它将热图像用作与RGB图像的互补信息,以应对低弹片或类似背景的场景中基于单型RGB的方法的降低。大多数现有方法提出了精心设计的结构,用于RGB-T人群计数中的跨模式融合。但是,这些方法在编码RGB-T图像对中编码跨模式上下文语义信息方面存在困难。考虑到上述问题,我们提出了一个称为多发意见融合网络(MAFNET)的两流RGB-T人群计数网络,该网络旨在根据注意机制完全捕获RGB和热模式中的远距离上下文信息。具体而言,在编码器部分中,多发融合(MAF)模块嵌入到全球级别的两个特定于模态分支的不同阶段中。此外,还引入了多模式的多尺度聚合(MMA)回归头,以充分利用跨模态的多尺度和上下文信息,以生成高质量的人群密度图。在两个受欢迎的数据集上进行的广泛实验表明,拟议的MAFNET对RGB-T人群计数有效,并实现了最新的性能。
RGB-Thermal (RGB-T) crowd counting is a challenging task, which uses thermal images as complementary information to RGB images to deal with the decreased performance of unimodal RGB-based methods in scenes with low-illumination or similar backgrounds. Most existing methods propose well-designed structures for cross-modal fusion in RGB-T crowd counting. However, these methods have difficulty in encoding cross-modal contextual semantic information in RGB-T image pairs. Considering the aforementioned problem, we propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet), which aims to fully capture long-range contextual information from the RGB and thermal modalities based on the attention mechanism. Specifically, in the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion at the global level. In addition, a Multi-modal Multi-scale Aggregation (MMA) regression head is introduced to make full use of the multi-scale and contextual information across modalities to generate high-quality crowd density maps. Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting and achieves the state-of-the-art performance.