论文标题
通过最大化Rényi熵的探索,以获取无奖励RL框架
Exploration by Maximizing Rényi Entropy for Reward-Free RL Framework
论文作者
论文摘要
探索对于增强学习至关重要(RL)。为了面对探索的挑战,我们考虑了一个无奖励的RL框架,该框架将勘探与剥削完全分开,并为探索算法带来了新的挑战。在探索阶段,代理商通过与无奖励环境进行互动并通过执行策略来收集过渡数据集来学习探索性政策。在计划阶段,代理商为基于数据集的任何奖励功能计算一个良好的策略,而无需与环境进一步互动。该框架适用于具有许多感兴趣的奖励功能的元设置。在探索阶段,我们建议最大化国家行动空间上的Renyi熵,并理论上证明这一目标是合理的。使用Renyi熵作为鼓励探索难以到达的国家行动的目标的成功。我们进一步推断出该目标的政策梯度公式,并设计一种可以处理复杂环境的实用探索算法。在计划阶段,我们使用批次RL算法解决了任意奖励功能的良好政策。从经验上讲,我们表明我们的探索算法是有效且有效的,并且在计划阶段为任意奖励功能提供了卓越的政策。
Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Renyi entropy over the state-action space and justify this objective theoretically. The success of using Renyi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.