论文标题

部分可观测时空混沌系统的无模型预测

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

论文作者

Li, Tianhong, Chang, Huiwen, Mishra, Shlok Kumar, Zhang, Han, Katabi, Dina, Krishnan, Dilip

论文摘要

生成建模和表示学习是计算机视觉中的两个关键任务。但是,这些模型通常是独立培训的,这忽略了每个任务帮助对方的潜力,而是导致培训和模型维护开销。在这项工作中,我们提出了掩盖的生成编码器(MAGE),这是统一SOTA图像生成和自我监督的表示学习的第一个框架。我们的关键见解是,在屏蔽图像建模预训练中使用可变掩蔽率可以允许在相同的训练框架下进行生成训练(非常高的掩盖比)和表示学习(较低的掩蔽比)。受到以前的生成模型的启发,Mage使用了在输入和输出下由向量定量的GAN学到的语义令牌,将其与掩盖结合在一起。我们可以通过向编码器输出添加对比度损失来进一步改善表示形式。我们广泛评估法师的产生和表示学习能力。在ImagEnet-1k上,单个MAGE VIT-L模型在类别图像生成的任务中获得了9.10 FID,线性探测的TOP-1精度为78.9%,在图像生成和表示学习中都能达到最新的表现。代码可在https://github.com/lth14/mage上找到。

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源