通过时间不一致的增强学习，自我监督的探索

论文标题

通过时间不一致的增强学习，自我监督的探索

Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

论文作者

Gao, Zijian, Xu, Kele, Zhai, Yuanzhao, Feng, Dawei, Ding, Bo, Mao, XinJun, Wang, Huaimin

论文摘要

在稀疏的外部奖励环境下，尽管对该领域的兴趣激增，但增强学习仍然具有挑战性。先前的尝试表明，内在的奖励可以减轻稀疏引起的问题。在本文中，我们提供了一种新颖的内在奖励，该奖励是受人类学习启发的，因为人类通过将当前的观察结果与历史知识进行比较来评估好奇心。我们的方法涉及培训自我监督的预测模型，保存模型参数的快照，并使用核标准来评估不同快照的预测作为内在奖励之间的时间不一致。我们还提出了一种各种加权机制，以自适应方式为不同的快照分配权重。我们在各种基准环境上的实验结果证明了我们方法的功效，该方法的功效优于其他基于奖励的方法，而没有额外的培训成本和更高的噪声耐受性。这项工作已提交给IEEE以供可能出版。版权可以在不通知的情况下传输，此后不再可以访问此版本。

Under sparse extrinsic reward settings, reinforcement learning has remained challenging, despite surging interests in this field. Previous attempts suggest that intrinsic reward can alleviate the issue caused by sparsity. In this article, we present a novel intrinsic reward that is inspired by human learning, as humans evaluate curiosity by comparing current observations with historical knowledge. Our method involves training a self-supervised prediction model, saving snapshots of the model parameters, and using nuclear norm to evaluate the temporal inconsistency between the predictions of different snapshots as intrinsic rewards. We also propose a variational weighting mechanism to assign weight to different snapshots in an adaptive manner. Our experimental results on various benchmark environments demonstrate the efficacy of our method, which outperforms other intrinsic reward-based methods without additional training costs and with higher noise tolerance. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

下载PDF全文

下载文献需遵守相关版权规定

论文标题