论文标题
SOS!自我监督的学习对以自我为中心行动识别的对象集
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition
论文作者
论文摘要
从视频数据中学习以自我为中心的动作识别模型,这是由于背景中的干扰因素(例如,无关紧要的对象)而具有挑战性。因此,将对象信息进一步整合到动作模型中是有益的。现有方法通常利用通用对象检测器来识别和表示场景中的对象。但是,仍然存在一些重要的问题。对于学习良好的对象表示仍需要对目标域(数据集)的质量质量质量良好的对象类注释。此外,以前的方法深深地融合了现有的动作模型,并需要与对象表示共同重新训练它们,从而导致昂贵且僵化的整合。为了克服这两个局限性,我们介绍了对集合(SOS)的自我监督学习,这是一种预先训练触点中的通用对象(OIC)表示模型的方法,该模型是从由现成的手动对象接触触点检测器检测到的视频对象区域的。我们将动作过程视为具有独特时空连续性的自然数据转换的一种手段,而不是像传统的自我监督学习一样单独增强对象区域,并利用了每个视频对象集之间的固有关系。在两个数据集(Epic-kitchens-100)和EGTEA上进行的广泛实验表明,我们的OIC显着提高了多个最先进的视频分类模型的性能。
Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object representation. Besides, previous methods deeply couple the existing action models and need to retrain them jointly with object representation, leading to costly and inflexible integration. To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector. Instead of augmenting object regions individually as in conventional self-supervised learning, we view the action process as a means of natural data transformations with unique spatio-temporal continuity and exploit the inherent relationships among per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100 and EGTEA, show that our OIC significantly boosts the performance of multiple state-of-the-art video classification models.