关于增强学习的功能近似：面对大状态空间的乐观

论文标题

关于增强学习的功能近似：面对大状态空间的乐观

On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces

论文作者

Yang, Zhuoran, Jin, Chi, Wang, Zhaoran, Wang, Mengdi, Jordan, Michael I.

论文摘要

加固学习的经典理论（RL）集中在价值函数的表格和线性表示上。进一步的进展取决于将RL与现代功能近似器（例如内核函数和深度神经网络）结合在一起，实际上，已经有许多经验成功，这些成功在大规模应用中利用了此类组合。但是，在开发一种支持该企业的理论方面存在着深远的挑战，最值得注意的是，需要考虑RL核心的勘探探索探索折衷，以及在基于现代功能 - 基于功能的学习系统中产生的计算和统计折衷。在动作值函数的背景下，我们通过研究最小二乘价值迭代算法的乐观修改来应对这些挑战以内核功能或过度参数化神经网络为代表。我们为该算法建立了多项式运行时复杂性和多项式样品复杂性，而没有对数据生成模型的其他假设。特别是，我们证明该算法会导致$ \ tilde {\ Mathcal {o}}（δ_ {\ Mathcal {\ Mathcal {f}} h^2 \ 2 \ sqrt {t}） $ \ Mathcal {f} $，$ h $是每集的长度，而$ t $是情节总数。我们的遗憾界限独立于状态数量，这一结果显然表现出RL功能近似的好处。

The classical theory of reinforcement learning (RL) has focused on tabular and linear representations of value functions. Further progress hinges on combining RL with modern function approximators such as kernel functions and deep neural networks, and indeed there have been many empirical successes that have exploited such combinations in large-scale applications. There are profound challenges, however, in developing a theory to support this enterprise, most notably the need to take into consideration the exploration-exploitation tradeoff at the core of RL in conjunction with the computational and statistical tradeoffs that arise in modern function-approximation-based learning systems. We approach these challenges by studying an optimistic modification of the least-squares value iteration algorithm, in the context of the action-value function represented by a kernel function or an overparameterized neural network. We establish both polynomial runtime complexity and polynomial sample complexity for this algorithm, without additional assumptions on the data-generating model. In particular, we prove that the algorithm incurs an $\tilde{\mathcal{O}}(δ_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $δ_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states, a result which exhibits clearly the benefit of function approximation in RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题