论文标题
带有Ambisonic Audio指导的全景视频显着对象检测
Panoramic Video Salient Object Detection with Ambisonic Audio Guidance
论文作者
论文摘要
视频显着对象检测(VSOD)是一个基本的计算机视觉问题,在过去的十年中已广泛讨论。但是,所有现有的作品都侧重于在2D方案中解决VSOD问题。随着VR设备的快速开发,全景视频是2D视频的有前途的替代品,可以为现实世界提供沉浸式感受。在本文中,我们旨在解决全景视频的视频显着对象检测问题以及相应的Ambisonic音频。提出了一个配备两个伪 - 塞亚姆音频视频上下文融合(ACF)块的多模式融合模块,以有效地进行视听相互作用。配备了球形位置编码的ACF块使3D上下文中的融合能够从等应角框架和Ambisonic Audios捕获像素和声源之间的空间对应关系。实验结果验证了我们提出的组件的有效性,并证明我们的方法在ASOD60K数据集上实现了最先进的性能。
Video salient object detection (VSOD), as a fundamental computer vision problem, has been extensively discussed in the last decade. However, all existing works focus on addressing the VSOD problem in 2D scenarios. With the rapid development of VR devices, panoramic videos have been a promising alternative to 2D videos to provide immersive feelings of the real world. In this paper, we aim to tackle the video salient object detection problem for panoramic videos, with their corresponding ambisonic audios. A multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion (ACF) blocks is proposed to effectively conduct audio-visual interaction. The ACF block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources from the equirectangular frames and ambisonic audios. Experimental results verify the effectiveness of our proposed components and demonstrate that our method achieves state-of-the-art performance on the ASOD60K dataset.