论文标题
我们真的需要在行动细分中进行时间卷积吗?
Do we really need temporal convolutions in action segmentation?
论文作者
论文摘要
动作分类取得了长足的进步,但是从长期未修剪的视频中进行细分和识别行动仍然是一个具有挑战性的问题。大多数最先进的方法着重于设计基于时间卷积的模型,但是时间卷积的僵化性和建模长期时间依赖性的困难限制了这些模型的潜力。具有适应性和序列建模功能的基于变压器的模型最近已用于各种任务。但是,缺乏电感偏差和处理长视频序列的效率低下,限制了变压器在动作分割中的应用。在本文中,我们通过合并时间抽样(称为时间U-Transformer(TUT))来设计一个基于纯变压器的模型。 U-Transformer体系结构降低了复杂性,同时引入了一种感应偏见,即相邻框架更可能属于同一类,但引入粗分辨率会导致边界的错误分类。我们观察到边界框架及其相邻帧之间的相似性分布取决于边界框架是动作段的开始还是结束。因此,我们进一步提出了基于注意模块框架之间相似性得分的分布,以增强识别边界的能力。广泛的实验显示了我们模型的有效性。
Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that adjacent frames are more likely to belong to the same class, but the introduction of coarse resolutions results in the misclassification of boundaries. We observe that the similarity distribution between a boundary frame and its neighboring frames depends on whether the boundary frame is the start or end of an action segment. Therefore, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries. Extensive experiments show the effectiveness of our model.