论文标题
有效地融合深度多任务表示形式,以进行鲁棒的视觉跟踪
Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking
论文作者
论文摘要
视觉对象跟踪仍然是计算机视觉中的一个积极研究领域,因为在现实世界中,各种特定于问题的因素持续存在挑战。许多基于歧视性相关过滤器(DCF)的现有跟踪方法采用特征提取网络(FEN)来对目标的外观进行建模。但是,使用基于不同残留神经网络(RESNET)的FEN提取的深度特征图,以前尚未研究。本文旨在评估基于DCF的框架中十二个基于Resnet的Fens的性能,以确定最佳的视觉跟踪目的。首先,它将对其最佳功能地图进行排名,并探讨最佳基于重新连接的FEN的广义采用。然后,提出的方法从完全卷积的FEN中提取了深层语义信息,并将其与最佳基于重新连接的特征图融合在一起,以在连续卷积过滤器的学习过程中加强目标表示。最后,它引入了一种新的有效的语义加权方法(使用每个视频框架上的语义分割特征图)来减少漂移问题。众所周知的OTB-2013,OTB-2015,TC-128和FOT-2018视觉跟踪数据集的广泛实验结果表明,在视觉跟踪的精确性和鲁棒性方面,所提出的方法有效地超过了最先进的方法。
Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.