论文标题
通过自回旋编码(RANDSAC)的随机段进行自我划分(RANDSAC)
Self-supervision through Random Segments with Autoregressive Coding (RandSAC)
论文作者
论文摘要
受到自然语言(GPT及其变体)的自我监督自我回归表示学习的成功的启发,以及最近使用视觉变形金刚(VITS)的视觉架构设计的进步,在本文中,我们探讨了各种设计选择对将此类培训策略应用于视觉功能学习的成功。具体而言,我们介绍了一种新颖的策略,我们称之为自回旋编码(RANDSAC)的随机段。在Randsac,我们将补丁表示(图像令牌)分组为层次排列的段;在每个片段中,代币的平行预测,类似于BERT,而跨段预测是顺序的,类似于GPT。我们说明,各个段的随机序列化显着提高了对空间长(跨阶段)和 - 归因(段内)预测的性能,并导致对特征学习有效的预测。我们说明了这些设计选择的相关性,并探索了许多数据集(例如CIFAR10,CIFAR100,Imagenet)上的替代方案。尽管我们的训练前策略与香草变压器一起使用,但我们还建议对解码器进行概念上的简单但有效的补充,该解码器允许可学习的跳过连接来编码$'$ s的特征层,从而进一步提高了性能。
Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy that we call Random Segments with Autoregressive Coding (RandSAC). In RandSAC, we group patch representations (image tokens) into hierarchically arranged segments; within each segment, tokens are predicted in parallel, similar to BERT, while across segment predictions are sequential, similar to GPT. We illustrate that randomized serialization of the segments significantly improves the performance and results in distribution over spatially-long (across-segments) and -short (within-segment) predictions which are effective for feature learning. We illustrate the pertinence of these design choices and explore alternatives on a number of datasets (e.g., CIFAR10, CIFAR100, ImageNet). While our pre-training strategy works with a vanilla Transformer, we also propose a conceptually simple, but highly effective, addition to the decoder that allows learnable skip-connections to encoder$'$s feature layers, which further improves the performance.