将纯正视觉变压器推向遥感基础模型

论文标题

将纯正视觉变压器推向遥感基础模型

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

论文作者

Wang, Di, Zhang, Qiming, Xu, Yufei, Zhang, Jing, Du, Bo, Tao, Dacheng, Zhang, Liangpei

论文摘要

大型视觉基础模型在自然图像上的视觉任务上取得了重大进展，由于视觉变压器的良好可扩展性和表示能力，因此视觉变压器是主要选择。但是，遥感（RS）中的大型模型尚未得到充分探索。在本文中，我们求助于具有约1亿个参数的普通视觉变压器，并首次尝试提出针对RS任务量身定制的大型视觉模型，并研究如此大型模型的性能。为了处理RS图像中任意取向的大小和对象，我们提出了一个新的旋转大小的窗户注意力，以取代变形金刚中的原始全部关注，这可以显着降低计算成本和内存足迹，同时通过从生成的不同窗口中提取丰富的上下文来学习更好的对象表示。检测任务的实验显示了我们模型的优越性，而不是所有最先进的模型，在DOTA-V1.0数据集上实现了81.24％的地图。与现有高级方法相比，我们在下游分类和细分任务上的模型结果也显示出竞争性能。进一步的实验显示了我们模型在计算复杂性和数据效率方面的优势。

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

下载PDF全文

下载文献需遵守相关版权规定

论文标题