嵌入以控制部分观察到的系统：具有可证明样品效率的表示学习

论文标题

嵌入以控制部分观察到的系统：具有可证明样品效率的表示学习

Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

论文作者

Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, Wang, Zhaoran

论文摘要

在部分观察到的马尔可夫决策过程（POMDP）中，强化学习面临两个挑战。（i）通常需要完整的历史才能预测未来，这会导致样本复杂性，该样本复杂性随着地平线的指数呈指数缩小。（ii）观察空间和状态空间通常是连续的，这会导致样品复杂性，该复杂性与外在维度成倍缩小。应对此类挑战需要通过利用POMDP的结构来学习观察和状态历史的最少但充分的表现。为此，我们提出了一种名为“嵌入到控制”（ETC）的强化学习算法，该算法在两个级别上学习了表示的同时优化策略。（ii）跨多个步骤等，学会用低维嵌入来表示完整的历史，该嵌入组装每个步骤功能。我们将（i）和（ii）集成到统一的框架中，该框架允许多种估计器（包括最大似然估计器和生成对抗网络）。对于一类POMDP，在过渡内核中具有低级别结构，等等获得了一个$ O（1/ε^2）$样品复杂性，该复杂度与地平线和内在维度多数范围（即等级）。这里$ε$是最佳差距。据我们所知，等等是第一个具有无限观察和状态空间的POMDP中表示和策略优化的样本效率算法。

Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/ε^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $ε$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.

下载PDF全文

下载文献需遵守相关版权规定

论文标题