论文标题
高分辨率语义视频细分的驯服网络
Tamed Warping Network for High-Resolution Semantic Video Segmentation
论文作者
论文摘要
快速语义视频细分的最新方法通过跨相邻帧的翘曲特征图降低了冗余,从而大大加快了推理阶段。但是,由于翘曲造成的错误,准确性会严重下降。在本文中,我们提出了一个新颖的框架,并在翘曲后设计了一个简单有效的校正阶段。具体来说,我们构建了一个非钥匙框CNN,将扭曲的上下文功能与当前的空间细节融合在一起。基于功能融合,我们的上下文特征整流〜(CFR)模块了解该模型与人均模型的区别以纠正扭曲的功能。此外,我们的残留引导注意力〜(RGA)模块利用压缩域中的残留图来帮助CRF专注于容易出错的区域。 CityScapes上的结果表明,准确性从67.3美元\%$ $ $ $ $ $ $明显提高,而速度边缘从$ 65.5 $ fps降至61.8美元$ fps,以1024美元的分辨率为$ 1024 \ times times 2048 $。对于非刚性类别,例如``人类''和``对象'',这些改进甚至高于18个百分点。
Recent approaches for fast semantic video segmentation have reduced redundancy by warping feature maps across adjacent frames, greatly speeding up the inference phase. However, the accuracy drops seriously owing to the errors incurred by warping. In this paper, we propose a novel framework and design a simple and effective correction stage after warping. Specifically, we build a non-key-frame CNN, fusing warped context features with current spatial details. Based on the feature fusion, our Context Feature Rectification~(CFR) module learns the model's difference from a per-frame model to correct the warped features. Furthermore, our Residual-Guided Attention~(RGA) module utilizes the residual maps in the compressed domain to help CRF focus on error-prone regions. Results on Cityscapes show that the accuracy significantly increases from $67.3\%$ to $71.6\%$, and the speed edges down from $65.5$ FPS to $61.8$ FPS at a resolution of $1024\times 2048$. For non-rigid categories, e.g., ``human'' and ``object'', the improvements are even higher than 18 percentage points.