环境声音帮助：在极端条件下的视听人群

论文标题

环境声音帮助：在极端条件下的视听人群

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

论文作者

Hu, Di, Mou, Lichao, Wang, Qingzhong, Gao, Junyu, Hua, Yuansheng, Dou, Dejing, Zhu, Xiao Xiang

论文摘要

最近对视觉人群计数进行了研究，以使人们在图像中的人群中计数。尽管成功，基于视觉的人群计数方法可能无法在极端条件下捕获内容丰富的特征，例如在夜间和遮挡中进行成像。在这项工作中，我们介绍了视听人群计数的一项新颖的任务，其中将视觉和听觉信息集成为计算目的。我们收集了一个大规模的基准测试，称为AudioVisual人群计数（Disco）数据集，由1,935张图像和相应的音频剪辑组成，并有170,270个注释的实例。为了融合这两种方式，我们利用了线性特征融合模块，该模块对视觉和听觉功能进行了仿射转换。最后，我们使用所提出的数据集和方法进行了广泛的实验。实验结果表明，引入听觉信息可以使人群在不同的照明，噪声和遮挡条件下受益。数据集和代码将发布。代码和数据已提供

Visual crowd counting has been recently studied as a way to enable people counting in crowd scenes from images. Albeit successful, vision-based crowd counting approaches could fail to capture informative features in extreme conditions, e.g., imaging at night and occlusion. In this work, we introduce a novel task of audiovisual crowd counting, in which visual and auditory information are integrated for counting purposes. We collect a large-scale benchmark, named auDiovISual Crowd cOunting (DISCO) dataset, consisting of 1,935 images and the corresponding audio clips, and 170,270 annotated instances. In order to fuse the two modalities, we make use of a linear feature-wise fusion module that carries out an affine transformation on visual and auditory features. Finally, we conduct extensive experiments using the proposed dataset and approach. Experimental results show that introducing auditory information can benefit crowd counting under different illumination, noise, and occlusion conditions. The dataset and code will be released. Code and data have been made available

下载PDF全文

下载文献需遵守相关版权规定

论文标题