论文标题
基于模型的强化学习的价值等效原则
The Value Equivalence Principle for Model-Based Reinforcement Learning
论文作者
论文摘要
从数据中学习环境的学习模型通常被视为建立智能增强学习(RL)代理的重要组成部分。普遍的做法是通过构建正确预测观察到的状态转换的环境动力学模型,将学习模型的学习与使用模型的使用分开。在本文中,我们认为,基于模型的RL代理的有限代表性资源更好地用于构建直接用于基于价值计划的模型。作为我们的主要贡献,我们介绍了价值等价原则:如果产生相同的钟声更新,则两个模型相对于一组功能和策略是相当的价值。我们根据价值等价原理提出了模型学习问题的表述,并分析了可行解决方案的集合如何受政策和功能的选择影响。具体而言,我们表明,随着我们增强所考虑的策略和功能的集合,值等效模型的缩小类别,直到最终崩溃到与完美描述环境的模型相对应的单点。在许多问题中,直接建模州到国家的过渡可能既困难又不必要。通过利用价值等价原理,一个人可能会找到更简单的模型而不会损害性能,节省计算和内存。我们用实验与更传统的对应物(如最大似然估计)进行了比较,以说明了价值等效模型学习的好处。更普遍地,我们认为,价值等价原理是RL中许多最近的经验成功的基础,例如价值迭代网络,预测,价值预测网络,TreeQn和Muzero,并提供了这些结果的首个理论基础。
Learning models of the environment from data is often viewed as an essential component to building intelligent reinforcement learning (RL) agents. The common practice is to separate the learning of the model from its use, by constructing a model of the environment's dynamics that correctly predicts the observed state transitions. In this paper we argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning. As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. We propose a formulation of the model learning problem based on the value equivalence principle and analyze how the set of feasible solutions is impacted by the choice of policies and functions. Specifically, we show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks, until eventually collapsing to a single point corresponding to a model that perfectly describes the environment. In many problems, directly modelling state-to-state transitions may be both difficult and unnecessary. By leveraging the value-equivalence principle one may find simpler models without compromising performance, saving computation and memory. We illustrate the benefits of value-equivalent model learning with experiments comparing it against more traditional counterparts like maximum likelihood estimation. More generally, we argue that the principle of value equivalence underlies a number of recent empirical successes in RL, such as Value Iteration Networks, the Predictron, Value Prediction Networks, TreeQN, and MuZero, and provides a first theoretical underpinning of those results.