通过参考优先分解几乎最佳的无模型增强学习

论文标题

通过参考优先分解几乎最佳的无模型增强学习

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

论文作者

Zhang, Zihan, Zhou, Yuan, Ji, Xiangyang

论文摘要

我们研究了有限的摩托车马尔可夫决策过程（MDPS），以$ S $状态，$ a $ ACTION和EVIPETION LENGUS $ H $来研究强化学习问题。我们提出了一种无模型的算法UCB-AFFANTAGE，并证明它可以实现$ \ tilde {o}（\ sqrt {h^2sat}）$遗憾，而$ t = kh $ and $ k $是播放的情节数量。我们的遗憾是[Jin等，2018]的结果改善，并匹配了最著名的基于模型的算法以及与对数因素的信息理论下限。我们还表明，UCB-Aftvantage可实现较低的本地切换成本，并适用于并发增强学习，从而改善了[Bai等，2019]的最新结果。

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

下载PDF全文

下载文献需遵守相关版权规定

论文标题