Patchblender：视频变压器的事先动议

论文标题

Patchblender：视频变压器的事先动议

PatchBlender: A Motion Prior for Video Transformers

论文作者

Prato, Gabriele, Song, Yale, Rajendran, Janarthanan, Hjelm, R Devon, Joshi, Neel, Chandar, Sarath

论文摘要

变形金刚已成为计算机视觉领域中的主要体系结构之一。但是，将此类体系结构应用于视频数据时仍存在一些挑战。最值得注意的是，这些模型难以有效地对视频数据的时间模式进行建模。直接针对此问题，我们介绍了Patch -Blender，这是一种可学习的混合功能，可在潜在空间的时间尺寸上通过补丁嵌入。我们表明，我们的方法成功地使视觉变形金刚编码视频数据的时间组件。在某种事物的V2和Movi-A上，我们表明我们的方法改善了视频变压器的基线性能。 PatchBlender具有几乎所有变压器体系结构兼容的优点，并且由于它是可学习的，因此该模型可以自适应地打开或关闭先验。它也是极轻的计算，是VIT-B的Gflops 0.005％。

Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.

下载PDF全文

下载文献需遵守相关版权规定

论文标题