通用事件边界检测的端到端压缩视频表示学习

论文标题

通用事件边界检测的端到端压缩视频表示学习

End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection

论文作者

Li, Congcong, Wang, Xinyao, Wen, Longyin, Hong, Dexiang, Luo, Tiejian, Zhang, Libo

论文摘要

通用事件边界检测旨在定位将视频细分为块的通用，无分类事件边界。现有方法通常需要在输入网络之前对视频帧进行解码，这需要相当大的计算能力和存储空间。为此，我们为事件边界检测提出了一个新的端到端压缩视频表示学习，该学习利用压缩域中的丰富信息，即RGB，运动向量，残差和内部图片（GOP）结构，而无需完全解码视频。具体来说，我们首先使用Convnet来提取GOPS中i-Frames的功能。之后，设计的轻型空间通道压缩编码器旨在根据其依赖I框架的运动向量，残差和表示，以计算P框架的特征表示。提出了一个时间对比模块来确定视频序列的事件边界。为了纠正注释的歧义并加快了训练过程，我们使用高斯内核来预处理事件边界。在Kinetics-GEBD数据集上进行的广泛实验表明，所提出的方法与最先进的方法获得了可比的结果，其$ 4.5 \ times $ $ $更快的运行速度。

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which demands considerable computational power and storage space. To that end, we propose a new end-to-end compressed video representation learning for event boundary detection that leverages the rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames based on the motion vectors, residuals and representations of their dependent I-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD dataset demonstrate that the proposed method achieves comparable results to the state-of-the-art methods with $4.5\times$ faster running speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题