论文标题
无模型和基于模型的强化学习的变异推断
Variational Inference for Model-Free and Model-Based Reinforcement Learning
论文作者
论文摘要
变异推理(VI)是一种特定类型的近似贝叶斯推理,其近似于棘手的后验分布,具有可牵引性的贝叶斯分布。 VI将推论问题作为优化问题,更具体地说,目标是最大程度地相对于近似后验参数,最大程度地提高了边缘可能性的对数的下限。另一方面,强化学习(RL)涉及自主代理人,以及如何使它们最佳地发挥作用,例如最大程度地提高预期未来累积奖励的概念。在非顺序的环境中,代理行动对未来的环境状态没有影响,RL被上下文的土匪和贝叶斯优化涵盖。但是,在适当的顺序场景中,代理商的行为影响未来的州,瞬时奖励需要与潜在的长期奖励进行仔细的交易。该手稿显示了VI和RL的明显不同主题是如何以两种基本方式链接的。首先,在非顺序和顺序设置中,在软策略约束下,可以通过VI目标恢复RL最大化未来累积奖励的优化目标。该政策限制不仅是人造的,而且在许多RL任务中被证明是有用的正规器,从而在代理性能方面得到了重大改进。其次,在基于模型的RL中,代理旨在了解其正在运行的环境,模型学习零件自然可以用作控制环境动态的过程中的推论问题。我们将区分后者的两种情况:当环境状态完全可以通过观察分布来部分观察到环境状态时,VI是VI的。
Variational inference (VI) is a specific type of approximate Bayesian inference that approximates an intractable posterior distribution with a tractable one. VI casts the inference problem as an optimization problem, more specifically, the goal is to maximize a lower bound of the logarithm of the marginal likelihood with respect to the parameters of the approximate posterior. Reinforcement learning (RL) on the other hand deals with autonomous agents and how to make them act optimally such as to maximize some notion of expected future cumulative reward. In the non-sequential setting where agents' actions do not have an impact on future states of the environment, RL is covered by contextual bandits and Bayesian optimization. In a proper sequential scenario, however, where agents' actions affect future states, instantaneous rewards need to be carefully traded off against potential long-term rewards. This manuscript shows how the apparently different subjects of VI and RL are linked in two fundamental ways. First, the optimization objective of RL to maximize future cumulative rewards can be recovered via a VI objective under a soft policy constraint in both the non-sequential and the sequential setting. This policy constraint is not just merely artificial but has proven as a useful regularizer in many RL tasks yielding significant improvements in agent performance. And second, in model-based RL where agents aim to learn about the environment they are operating in, the model-learning part can be naturally phrased as an inference problem over the process that governs environment dynamics. We are going to distinguish between two scenarios for the latter: VI when environment states are fully observable by the agent and VI when they are only partially observable through an observation distribution.