论文标题
视频理解是机器翻译
Video Understanding as Machine Translation
论文作者
论文摘要
随着大规模多模式视频数据集的出现,尤其是带有音频或抄录语音的序列,人们对视频表示的自我监督学习越来越兴趣。大多数先前的工作将目标提出为对比度的度量学习问题。但是,为了实现有效的学习,这些策略需要仔细选择正面指定的课程策略的正面样本和负面样本。在这项工作中,我们通过采用一种生成建模方法来消除对阴性采样的需求,该方法将目标作为模式之间的翻译问题。这样的公式使我们能够通过一个统一的框架来解决各种各样的下游视频理解任务,而无需大量的对比度度量学习中常见的负面样本。我们尝试了大规模的HOWTO100M数据集进行培训,并报告在几个下游任务上的最先进的表现,包括视频分类(Epic-kitchens)(Epic-kitchens),问答(TVQA),字幕(TVC,YouCook2,YouCook2和MSR-VTT)以及基于文本的clip retrieval(YouCook2和MSSMSR-MSR-MSR-MSR-VTT)。
With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).