Invpt：倒金字塔多任务变压器，以了解密集的场景

论文标题

Invpt：倒金字塔多任务变压器，以了解密集的场景

InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

论文作者

Ye, Hanrong, Xu, Dan

论文摘要

多任务密集的场景理解是一个蓬勃发展的研究领域，需要同时对与像素预测的一系列相关任务进行推理。由于卷积操作的大量利用，大多数现有作品都遇到了当地建模的严重局限，而在全球空间位置和多任务上下文中学习相互作用和推断对于此问题至关重要。在本文中，我们提出了一种新颖的端到端倒立金字塔多任务变压器（Invpt），以在统一框架中对空间位置和多个任务进行同时建模。据我们所知，这是探索设计变压器结构的第一批工作，以用于多任务密集的预测，以了解场景的理解。此外，广泛的证明，较高的空间分辨率对密集的预测非常有益，而对于现有的变形金刚来说，由于对大空间大小的巨大复杂性，现有变形金刚更深入地采用更高的分辨率。 Invpt提出了一个有效的上移动器块，以逐渐增加的分辨率学习多任务特征交互，这还结合了有效的自我发言消息传递和多规模特征聚合，以高分辨率产生特定于任务的预测。我们的方法分别在NYUD-V2和Pascal-Context数据集上实现了卓越的多任务性能，并且显着优于先前的最先前。该代码可在https://github.com/prismformore/invpt上找到

Multi-task dense scene understanding is a thriving research domain that requires simultaneous perception and reasoning on a series of correlated tasks with pixel-wise prediction. Most existing works encounter a severe limitation of modeling in the locality due to heavy utilization of convolution operations, while learning interactions and inference in a global spatial-position and multi-task context is critical for this problem. In this paper, we propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. To the best of our knowledge, this is the first work that explores designing a transformer structure for multi-task dense prediction for scene understanding. Besides, it is widely demonstrated that a higher spatial resolution is remarkably beneficial for dense predictions, while it is very challenging for existing transformers to go deeper with higher resolutions due to huge complexity to large spatial size. InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions, which also incorporates effective self-attention message passing and multi-scale feature aggregation to produce task-specific prediction at a high resolution. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts. The code is available at https://github.com/prismformore/InvPT

下载PDF全文

下载文献需遵守相关版权规定

论文标题