futh-net：融合了空中视频分类的时间关系和整体特征

论文标题

futh-net：融合了空中视频分类的时间关系和整体特征

FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

论文作者

Jin, Pu, Mou, Lichao, Hua, Yuansheng, Xia, Gui-Song, Zhu, Xiao Xiang

论文摘要

由于其低成本和快速移动性，无人驾驶汽车（UAV）现在已广泛应用于数据获取。随着航空视频量的增加，自动解析这些视频的需求正在激增。为了实现这一目标，当前的研究主要集中于在空间和时间维度沿着卷积的整体特征提取整体特征。但是，这些方法受到小时接收场的限制，无法充分捕获长期的时间依赖性，这对于描述复杂动力学很重要。在本文中，我们提出了一个新颖的深神经网络，称为futh-net，不仅为整体特征建模，而且还建模用于空中视频分类的时间关系。此外，新型融合模块中的多尺度时间关系可以完善整体特征，以产生更具歧视性的视频表示。更特别地，Futh-NET采用了两条轨道结构：（1）学习框架外观和短期时间变化的一般特征的整体表示途径，以及（2）捕获跨任意框架的多规模时间关系的时间关系途径，提供长期的时间依赖性。之后，提出了一个新型的融合模块来时空整合从这两种途径中学到的两个特征。我们的模型对两个航空视频分类数据集进行了评估，即ERA和无人机操作，并实现了最新结果。这表明了其在不同识别任务（事件分类和人类行动识别）之间的有效性和良好的概括能力。为了促进进一步的研究，我们在https://gitlab.lrz.de/ai4eo/reasoning/futh-net上发布该代码。

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.

下载PDF全文

下载文献需遵守相关版权规定

论文标题