论文标题
Eclipse:使用视觉和声音有效的远程视频检索
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
论文作者
论文摘要
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从如此长的视频中处理数百个密集提取的框架相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅远程视频的方法快2.92倍和2.34倍的记忆效率外,我们的方法还可以在几种不同的远程视频数据集(例如ActivityNet,QvHighlights,Youcook2,didemo和charades)上获得更好的文本到视频检索准确性。
We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous approaches designed for short video retrieval (e.g., 5-15 seconds in duration), our approach aims to retrieve minute-long videos that capture complex human actions. One challenge of standard video-only approaches is the large computational cost associated with processing hundreds of densely extracted frames from such long videos. To address this issue, we propose to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process. Our method, named ECLIPSE (Efficient CLIP with Sound Encoding), adapts the popular CLIP model to an audiovisual video setting, by adding a unified audiovisual transformer block that captures complementary cues from the video and audio streams. In addition to being 2.92x faster and 2.34x memory-efficient than long-range video-only approaches, our method also achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.