论文标题
VHR遥感图像中的傅立叶复杂领域的上下文学习
Contextual Learning in Fourier Complex Field for VHR Remote Sensing Images
论文作者
论文摘要
非常高分辨率(VHR)遥感(RS)图像分类是RS图像分析和理解的基本任务。最近,基于变压器的模型显示出从具有通用分辨率(224x224像素)的自然图像学习高阶上下文关系的杰出潜力,并在一般图像分类任务上取得了显着的结果。但是,随着图像尺寸的增加,幼稚变压器的复杂性四次增长,从而防止了来自VHR RS Image(500x500像素)分类和其他计算昂贵的下游任务的模型。为此,我们建议通过离散的傅立叶变换(DFT)将昂贵的自我注意事项(SA)分解为真实和虚构的部分,因此提出了有效的复杂自我注意(CSA)机制。 CSA受益于DFT的共轭对称特性,能够用少于Naive SA的一半计算对高阶上下文信息进行建模。为了克服在傅立叶复合场中的梯度爆炸,我们用精心设计的LogMax函数替换了SoftMax函数,以使CSA的注意力图正常化并稳定梯度传播。通过堆叠CSA块的各个层,我们提出了傅立叶复合变压器(FCT)模型,以根据分层的方式从VHR航空图像中学习全局上下文信息。对常用RS分类数据集进行的通用实验证明了FCT的有效性和效率,尤其是在非常高分辨率的RS图像上。
Very high-resolution (VHR) remote sensing (RS) image classification is the fundamental task for RS image analysis and understanding. Recently, transformer-based models demonstrated outstanding potential for learning high-order contextual relationships from natural images with general resolution (224x224 pixels) and achieved remarkable results on general image classification tasks. However, the complexity of the naive transformer grows quadratically with the increase in image size, which prevents transformer-based models from VHR RS image (500x500 pixels) classification and other computationally expensive downstream tasks. To this end, we propose to decompose the expensive self-attention (SA) into real and imaginary parts via discrete Fourier transform (DFT) and therefore propose an efficient complex self-attention (CSA) mechanism. Benefiting from the conjugated symmetric property of DFT, CSA is capable to model the high-order contextual information with less than half computations of naive SA. To overcome the gradient explosion in Fourier complex field, we replace the Softmax function with the carefully designed Logmax function to normalize the attention map of CSA and stabilize the gradient propagation. By stacking various layers of CSA blocks, we propose the Fourier Complex Transformer (FCT) model to learn global contextual information from VHR aerial images following the hierarchical manners. Universal experiments conducted on commonly used RS classification data sets demonstrate the effectiveness and efficiency of FCT, especially on very high-resolution RS images.