论文标题

通过短期记忆可证明的强化学习

Provable Reinforcement Learning with a Short-Term Memory

论文作者

Efroni, Yonathan, Jin, Chi, Krishnamurthy, Akshay, Miryoosefi, Sobhan

论文摘要

现实世界中的顺序决策问题通常涉及部分可观察性,这要求代理人维持对历史记忆的记忆,以推断潜在的状态,计划和做出良好的决定。一般而言,应对部分可观察性是极具挑战性的,因为在学习部分可观察到的马尔可夫决策过程(POMDPS)中,许多最坏的统计和计算障碍已知。本文提议研究一个新的POMDPs,它的潜在状态可以通过短期$ M $的最新历史来解码,这是由几种物理应用中的问题结构以及一种称为“框架堆叠”的常用技术的动机。我们在样本复杂性上建立了一组上限和下限,以在表格和丰富的观察环境(观测值数量)中学习此类问题的近乎最佳策略。特别是,在丰富的观点环境中,我们使用一种新型的“力矩匹配”方法开发了新算法,其样品复杂性与短长度$ m $而不是问题范围呈指数缩放,并且与观察次数无关。我们的结果表明,在这些环境中,短期记忆足以加强学习。

Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源