论文标题

嵌入以控制部分观察到的系统:具有可证明样品效率的表示学习

Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

论文作者

Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, Wang, Zhaoran

论文摘要

在部分观察到的马尔可夫决策过程(POMDP)中,强化学习面临两个挑战。 (i)通常需要完整的历史才能预测未来,这会导致样本复杂性,该样本复杂性随着地平线的指数呈指数缩小。 (ii)观察空间和状态空间通常是连续的,这会导致样品复杂性,该复杂性与外在维度成倍缩小。应对此类挑战需要通过利用POMDP的结构来学习观察和状态历史的最少但充分的表现。 为此,我们提出了一种名为“嵌入到控制”(ETC)的强化学习算法,该算法在两个级别上学习了表示的同时优化策略。 (ii)跨多个步骤等,学会用低维嵌入来表示完整的历史,该嵌入组装每个步骤功能。我们将(i)和(ii)集成到统一的框架中,该框架允许多种估计器(包括最大似然估计器和生成对抗网络)。对于一类POMDP,在过渡内核中具有低级别结构,等等获得了一个$ O(1/ε^2)$样品复杂性,该复杂度与地平线和内在维度多数范围(即等级)。这里$ε$是最佳差距。据我们所知,等等是第一个具有无限观察和状态空间的POMDP中表示和策略优化的样本效率算法。

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/ε^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $ε$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源