Pointtad：具有可学习查询点的多标签时间动作检测

论文标题

Pointtad：具有可学习查询点的多标签时间动作检测

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

论文作者

Tan, Jing, Zhao, Xiaotong, Shi, Xintian, Kang, Bin, Wang, Limin

论文摘要

传统的时间动作检测（TAD）通常会从单个标签（例如ActivityNet，Thumos）中处理未修剪的视频。但是，这种设置可能是不现实的，因为在实践中经常同时发生的行动类别。在本文中，我们专注于多标签时间动作检测的任务，该操作旨在从多标签的未修剪视频中本地化所有操作实例。多标签TAD更具挑战性，因为它需要在单个视频中进行细粒度的类别歧视，并在同时发生的实例中进行精确本地化。为了减轻此问题，我们扩展了传统TAD的稀疏基于查询的检测范式，并提出了Pointtad的多标签TAD框架。具体而言，我们的Pointtad引入了一小部分可学习的查询点，以表示每个动作实例的重要帧。基于点的表示提供了一种灵活的机制，可以将判别框架定位在边界以及动作中的重要框架。此外，我们使用多级交互模块执行动作解码过程，以捕获点级别和实例级动作语义。最后，我们的Pointtad仅基于RGB输入而采用端到端的可训练框架，以便于部署。我们在两个流行的基准上评估了我们提出的方法，并为多标签TAD引入了新的检测图指标。我们的模型在检测图公制下的大幅度优于所有以前的方法，并且在分割映射度量公制下也取得了令人鼓舞的结果。代码可在https://github.com/mcg-nju/pointtad上找到。

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题