论文标题
贝尔曼剩余正交化,用于离线加固学习
Bellman Residual Orthogonalization for Offline Reinforcement Learning
论文作者
论文摘要
我们提出和分析了一个强化学习原理,该原则仅在测试功能的用户定义空间沿其定义的空间来实现钟声方程。我们专注于使用功能近似的应用程序应用程序,我们利用此原理来得出置信区间以进行非政策评估,并优化规定的策略类别中的策略。我们证明,根据任意比较器策略的价值和不确定性之间的权衡,对我们的政策优化程序造成了甲骨文不平等。测试功能空间的不同选择使我们能够解决共同框架中的不同问题。我们表征了使用我们的程序从政策转移到货球数据的效率丧失,并建立了与过去工作中研究的浓缩性系数的连接。我们深入研究了具有线性函数近似的方法的实施,即使贝尔曼闭合不起,也可以通过多项式时间实现提供理论保证。
We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from on-policy to off-policy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold.