论文标题
魔鬼的频率是:用于自我监视的视觉预训练的GESTAIND GESTALT AUTOCODER
The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training
论文作者
论文摘要
遵循从蒙版图像中恢复内容的“掩盖和重建”管道之后,自我监督的掩蔽图像建模(MIM)模式最近捕获了对多媒体社区的日益兴趣,这是因为从无标记的数据中学习视觉表示的出色能力。一组作品旨在以高语义的摘要学习表现形式,试图通过大比率掩盖策略重建非语音像素,这些像素可能会遇到“过度平滑”的问题,而其他人则将语义直接以离线方式注入目标,需要额外的数据。与它们不同,我们将视角转移到了自然具有全局视角的傅立叶域,并提出了新的蒙版图像建模(MIM),称为Gestalt Gestalt AutoCododer(GE $^2 $ -AE)进行视觉预训练。具体而言,我们为模型配备了来自像素和频率空间重建图像内容的二氧化解码器,在这些解码器中,彼此不仅用作补充,而且还充当互惠约束。通过这种方式,可以在预先训练的编码器中学习更健壮的表示,通过对下游识别任务的实验结果并置的实验结果证实了有效性。我们还进行了几项定量和定性实验,以研究我们方法的学习行为。据我们所知,这是通过频域的镜头来解决视觉预训练的第一项MIM工作。
The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the multimedia community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others directly infuse semantics into targets in off-line way requiring extra data. Different from them, we shift the perspective to the Fourier domain which naturally has global perspective and present a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge$^2$-AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints. Through this way, more robust representations can be learned in the pre-trained encoders, of which the effectiveness is confirmed by the juxtaposing experimental results on downstream recognition tasks. We also conduct several quantitative and qualitative experiments to investigate the learning behavior of our method. To our best knowledge, this is the first MIM work to solve the visual pre-training through the lens of frequency domain.