论文标题
D2-NET:通过歧视性嵌入和脱氧激活的弱监督行动定位
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations
论文作者
论文摘要
这项工作提出了一个弱监督的时间动作本地化框架,称为D2-NET,该框架致力于使用视频级别的监督在时间上定位动作。我们的主要贡献是引入一种新型的损失公式,该公式共同增强了潜在嵌入的可区分性和输出时间类激活相对于由弱监督引起的前景噪声的鲁棒性。拟议的配方包括增强时间行动定位的歧视性和降解性损失项。歧视性术语结合了分类损失,并利用自上而下的注意机制来增强潜在前后背景嵌入的可分离性。使用自下而上的注意力机制同时最大程度地提高视频内部和视频间信息,可以通过自下而上的注意力机制明确地解决类激活中的前景噪声。结果,强调前景区域中的激活,而背景区域中的激活被抑制,从而导致了更强大的预测。全面的实验是在包括Thumos14和ActivityNet1.2在内的多个基准上进行的。与所有数据集上的现有方法相比,我们的D2-NET表现出色,在Thumos14上的IOU = 0.5的地图上获得了高达2.3%的增长。源代码可从https://github.com/naraysa/d2-net获得
This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on multiple benchmarks, including THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on all datasets, achieving gains as high as 2.3% in terms of mAP at IoU=0.5 on THUMOS14. Source code is available at https://github.com/naraysa/D2-Net