COOT：用于视频文本表示的合作层次变压器学习

论文标题

COOT：用于视频文本表示的合作层次变压器学习

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

论文作者

Ging, Simon, Zolfaghari, Mohammadreza, Pirsiavash, Hamed, Brox, Thomas

论文摘要

许多现实世界的视频文本任务都涉及不同级别的粒度，例如框架和单词，剪辑和句子或视频和段落，每个语言都有不同的语义。在本文中，我们提出了一个合作的层次变压器（COOT）来利用这种层次结构信息，并模拟不同粒度和不同方式之间的相互作用。 The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text.所得的方法比较有几个基准，而参数很少。所有代码均可在https://github.com/gingsi/coot-videotext上提供开源

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

下载PDF全文

下载文献需遵守相关版权规定

论文标题