加强学习比土匪更难吗？一种近距离的算法，逃脱了地平线的诅咒

论文标题

加强学习比土匪更难吗？一种近距离的算法，逃脱了地平线的诅咒

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

论文作者

Zhang, Zihan, Ji, Xiangyang, Du, Simon S.

论文摘要

情节增强学习和上下文匪徒是两个广泛研究的顺序决策问题。情节增强学习概括了上下文的匪徒，并且由于较长的计划范围和未知状态依赖性的过渡，通常被认为更加困难。当前的论文表明，较长的计划范围和未知状态依赖性的过渡（最多）在样本复杂性上几乎没有其他难度。我们考虑使用$ S $州的情节增强学习，$ a $ ACTICY，计划地面$ h $，总奖励为$ 1 $，代理商播放$ k $情节。我们提出了一种新算法，\ textbf {m} onotonic \ textbf {v} alue \ textbf {p} ropagation（mvp），它依赖于新的Bernstein-type奖励。与现有的奖励结构相比，新的奖励更加紧密，因为它基于设计良好的单调价值函数。特别是，奖金中的\ emph {常数}应巧妙地设置，以确保乐观和单调性。我们显示MVP享受$ o \ left（\ left（\ sqrt {sak} + s^2a \ right）\ poly \ log \ log \ left（sahk \ right）\ right）$遗憾，接近$ω\ lest（\ sqrt {sqrt {sak} {sak} \ right）$ wout for for f enterct f entern forect \ emphite \ emphist cantesce forcept f onscept \ emphist。值得注意的是，此结果1）\ emph {指数性地}通过Dann等人改善了最新的多项式时间算法。 [2019]和Zanette等。 [2019]就$ h $的依赖而言，2）\ emph {指数}改善了[Wang等人的运行时间。 [2020]]，并大大改善了对$ S $，$ a $ a $ a $ a $ $ k $的样本复杂性的依赖性。

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $Ω\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题