使用自我监督的方法预处理变压器，以基于视觉的深度强化学习

论文标题

使用自我监督的方法预处理变压器，以基于视觉的深度强化学习

Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

论文作者

Goulão, Manuel, Oliveira, Arlindo L.

论文摘要

视觉变压器体系结构已显示在计算机视觉（CV）空间中具有竞争力，在该空间中，它在多个基准测试中剥夺了基于卷积的网络。然而，卷积神经网络（CNN）仍然是强化学习中表示模块的优先架构。在这项工作中，我们使用几种最先进的自我监管方法研究了视觉变压器预处理，并评估了学到的表示的质量。为了显示在这种情况下的时间维度的重要性，我们提出了VICREG的扩展，以通过添加时间订单验证任务来更好地捕获观测值之间的时间关系。我们的结果表明，所有方法均在学习有用的表示形式中有效，并避免了来自Atari学习环境（ALE）观察结果的代表性崩溃，这在我们评估强化学习（RL）时会导致数据效率的提高。此外，通过时间顺序验证任务预测的编码器显示了所有实验的最佳结果，并具有更丰富的表示，更集中的注意力图和较稀疏的代表向量遍布编码器的整个层，这表明了探索此类相似性维度的重要性。通过这项工作，我们希望对VIT在自我保护的预处理中所学到的代表性提供一些见解，并通过RL环境的观察结果以及导致表现最佳的代理的表述中产生的特性。源代码将可用：https：//github.com/mgoulao/tov-vicreg

The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data efficiency when we evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with the temporal order verification task shows the best results across all experiments, with richer representations, more focused attention maps and sparser representation vectors throughout the layers of the encoder, which shows the importance of exploring such similarity dimension. With this work, we hope to provide some insights into the representations learned by ViT during a self-supervised pretraining with observations from RL environments and which properties arise in the representations that lead to the best-performing agents. The source code will be available at: https://github.com/mgoulao/TOV-VICReg

下载PDF全文

下载文献需遵守相关版权规定

论文标题