$ q $ - 对数遗憾的学习

论文标题

$ q $ - 对数遗憾的学习

$Q$-learning with Logarithmic Regret

论文作者

Yang, Kunhe, Yang, Lin F., Du, Simon S.

论文摘要

本文介绍了第一个非反应结果，表明如果存在最佳$ Q $函数中严格的积极亚副本差距，则无模型算法可以对情节性表格增强学习实现对数累积遗憾。我们证明了[Jin等人的乐观$ q $ - 学习。 2018]享受$ {\ Mathcal {o}} \ left（\ frac {sa \ cdot \ cdot \ mathrm {poly} \ left（h \ right）} {δ__{\ min}}} \ log log \ log \ log \ weft（sat \ right）计划范围，$ t $是步骤的总数，$δ_ {\ min} $是最小次级优势差距。此界限将信息理论下限匹配$ s，a，t $ to $ \ log \ left（sa \ firt）$ factor。我们进一步将分析扩展到折扣设置，并获得类似的对数累积遗憾。

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{Δ_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Δ_{\min}$ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

下载PDF全文

下载文献需遵守相关版权规定

论文标题