事后看来的好奇心：随机环境中的内在探索

论文标题

事后看来的好奇心：随机环境中的内在探索

Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments

论文作者

Jarrett, Daniel, Tallec, Corentin, Altché, Florent, Mesnard, Thomas, Munos, Rémi, Valko, Michal

论文摘要

考虑在稀疏奖励或无奖励环境中的探索问题，例如在蒙特祖玛的报仇中。在好奇心驱动的范式中，代理因每个实现的结果与预测结果的不同而受到奖励。但是，在随机环境中使用预测误差作为内在动机是脆弱的，因为代理可能会被州行动空间的高渗透区所困扰，例如“嘈杂的电视”。在这项工作中，我们研究了一个自然解决世界的自然解决方案：我们的关键思想是学习对未来的代表性，以精确捕获每个结果的不可预测的方面 - 我们用作预测的附加意见，例如，内在的奖励只反映了世界动力学的可预测方面。首先，我们提出将这种后视表示形式纳入模型，以将“噪声”与“新颖性”相关联，从而引起了事后观察的好奇心：好奇心的简单且可扩展的概括，对随机性是有力的。其次，我们将该框架实例化，以作为我们的主要示例，以示例示例，从而导致了噪声byol-Hindsight。第三，我们说明了它在网格世界中各种不同的随机性下的行为，并在用粘性动作的硬探索Atari游戏中发现了对Byol-explore的改进。值得注意的是，我们展示了最先进的结果，可以通过粘性动作进行报复，同时在非粘性环境中保持表现。

Consider the problem of exploration in sparse-reward or reward-free environments, such as in Montezuma's Revenge. In the curiosity-driven paradigm, the agent is rewarded for how much each realized outcome differs from their predicted outcome. But using predictive error as intrinsic motivation is fragile in stochastic environments, as the agent may become trapped by high-entropy areas of the state-action space, such as a "noisy TV". In this work, we study a natural solution derived from structural causal models of the world: Our key idea is to learn representations of the future that capture precisely the unpredictable aspects of each outcome -- which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics. First, we propose incorporating such hindsight representations into models to disentangle "noise" from "novelty", yielding Curiosity in Hindsight: a simple and scalable generalization of curiosity that is robust to stochasticity. Second, we instantiate this framework for the recently introduced BYOL-Explore algorithm as our prime example, resulting in the noise-robust BYOL-Hindsight. Third, we illustrate its behavior under a variety of different stochasticities in a grid world, and find improvements over BYOL-Explore in hard-exploration Atari games with sticky actions. Notably, we show state-of-the-art results in exploring Montezuma's Revenge with sticky actions, while preserving performance in the non-sticky setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题