论文标题
后悔的MDP界限
Regret Bounds for Discounted MDPs
论文作者
论文摘要
传统上从情节的角度理解了强化学习(RL)。非剧本RL的概念,没有重新启动,因此没有可靠的恢复,仍然难以捉摸。非疾病的RL中的一个基本问题是如何衡量学习者的性能并得出算法以最大程度地提高这种性能。传统观点是最大程度地提高学习者获得的平均奖励与最大长期平均奖励之间的差异。在本文中,我们认为,如果与环境的复杂性相比,总时间预算相对有限,那么这种比较可能无法反映学习者的有限时间最优性。我们提出了一个称为$γ$ regret的措施家族,我们认为这可以更好地捕获有限的时间最优性。我们提供动机,并获得此类措施的下限和上限。注意:后续工作(ARXIV:2010.00587)改善了我们的下层和上限,现在的差距已在$ \tildeθ\ left(\ frac {\ sqrt {sat}}}} {(1 -γ)^{\ frac {\ frac {\ frac {1} {2} {2} {2} {2} {2}}} \ right时封闭。
Reinforcement learning (RL) has traditionally been understood from an episodic perspective; the concept of non-episodic RL, where there is no restart and therefore no reliable recovery, remains elusive. A fundamental question in non-episodic RL is how to measure the performance of a learner and derive algorithms to maximize such performance. Conventional wisdom is to maximize the difference between the average reward received by the learner and the maximal long-term average reward. In this paper, we argue that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. We propose a family of measures, called $γ$-regret, which we believe to better capture the finite-time optimality. We give motivations and derive lower and upper bounds for such measures. Note: A follow-up work (arXiv:2010.00587) has improved both our lower and upper bound, the gap is now closed at $\tildeΘ\left(\frac{\sqrt{SAT}}{(1 - γ)^{\frac{1}{2}}}\right)$.