论文标题

通过参考优先分解几乎最佳的无模型增强学习

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

论文作者

Zhang, Zihan, Zhou, Yuan, Ji, Xiangyang

论文摘要

我们研究了有限的摩托车马尔可夫决策过程(MDPS),以$ S $状态,$ a $ ACTION和EVIPETION LENGUS $ H $来研究强化学习问题。我们提出了一种无模型的算法UCB-AFFANTAGE,并证明它可以实现$ \ tilde {o}(\ sqrt {h^2sat})$遗憾,而$ t = kh $ and $ k $是播放的情节数量。我们的遗憾是[Jin等,2018]的结果改善,并匹配了最著名的基于模型的算法以及与对数因素的信息理论下限。我们还表明,UCB-Aftvantage可实现较低的本地切换成本,并适用于并发增强学习,从而改善了[Bai等,2019]的最新结果。

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源