论文标题
模仿过去的成功可能是非常最佳的
Imitating Past Successes can be Very Suboptimal
论文作者
论文摘要
先前的工作提出了一种简单的加固学习策略(RL):标签经验,具有该经验中取得的成果,然后模仿重新标记的经验。这些结局的模仿学习方法由于其简单性,强大的表现和与监督学习的紧密联系而具有吸引力。但是,尚不清楚这些方法与标准RL目标,最大化如何相关。在本文中,我们正式将成果条件的模仿学习学习以奖励最大化,在学习策略和Q值之间建立了精确的关系,并解释了这些方法与先前基于EM的策略搜索方法之间的紧密联系。该分析表明,现有的结果模仿学习方法并不一定改善政策,而是简单的修改会导致在某些假设下可以保证策略改进的方法。
Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we formally relate outcome-conditioned imitation learning to reward maximization, drawing a precise relationship between the learned policy and Q-values and explaining the close connections between these methods and prior EM-based policy search methods. This analysis shows that existing outcome-conditioned imitation learning methods do not necessarily improve the policy, but a simple modification results in a method that does guarantee policy improvement, under some assumptions.