Epic-Kitchens数据集：收集，挑战和基线

论文标题

Epic-Kitchens数据集：收集，挑战和基线

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

论文作者

Damen, Dima, Doughty, Hazel, Farinella, Giovanni Maria, Fidler, Sanja, Furnari, Antonino, Kazakos, Evangelos, Moltisanti, Davide, Munro, Jonathan, Perrett, Toby, Price, Will, Wray, Michael

论文摘要

自2018年推出以来，Epic-Kitchens引起了人们的关注，成为最大的以自我为中心的视频基准，对人们与对象，关注甚至意图的互动提供了独特的观点。在本文中，我们详细介绍了如何在其本地厨房环境中捕获32位参与者的大规模数据集，并用动作和对象相互作用进行了密集的注释。我们的视频描绘了非录制的日常活动，因为每次参与者进入厨房时都会开始录制。录音是在4个国家的4个国家 /地区进行的，属于10种不同的国籍，从而产生了高度多样化的厨房习惯和烹饪风格。我们的数据集包含55小时的视频，由1150万帧组成，我们密集地标记了总计39.6万个操作段和454.2K对象边界框。我们的注释是独一无二的，因为我们让参与者在录制后叙述自己的视频，从而反映了真正的意图，并且我们基于这些。我们描述我们的对象，动作和。期待挑战，并在两个测试拆分上评估了几个基线，即看不见和看不见的厨房。我们介绍了新的基础线，以突出数据集的多模式性质以及明确的时间建模对区分细颗粒动作的重要性，例如从“打开”它的“关闭水龙头”。

Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.

下载PDF全文

下载文献需遵守相关版权规定

论文标题