论文标题
近似政策迭代及其三拟合指标
Approximate Policy Iteration with Bisimulation Metrics
论文作者
论文摘要
分配度量指标根据奖励序列的比较来定义马尔可夫决策过程(MDP)状态之间的距离度量。由于此属性,它们提供了价值函数近似(VFA)的理论保证。在这项工作中,我们首先证明,可以通过更一般的sindhorn距离来定义双仿真和$π$ - 仿真指标,该距离统一了最近工作中使用的各种状态相似性指标。然后,我们描述了一种近似政策迭代(API)程序,该程序使用基于双分解的状态空间的VFA离散化,并证明渐近性能界限。接下来,我们就政策本身的变化而言,限制了$π$ bisimulation指标之间的差异。基于这些结果,我们设计了一个API($α$)程序,该程序采用了保守的政策更新,并且比Naive API方法更有绩效界限。我们讨论了此类API程序如何将使用分为单位指标用于州表示学习的实用参与者批评方法。最后,我们通过基于对有限MDP的基于双象征的API的实施的受控经验分析来验证我们的理论结果,并通过受控的经验分析来研究其实际含义。
Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation (VFA). In this work we first prove that bisimulation and $π$-bisimulation metrics can be defined via a more general class of Sinkhorn distances, which unifies various state similarity metrics used in recent work. Then we describe an approximate policy iteration (API) procedure that uses a bisimulation-based discretization of the state space for VFA and prove asymptotic performance bounds. Next, we bound the difference between $π$-bisimulation metrics in terms of the change in the policies themselves. Based on these results, we design an API($α$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. We discuss how such API procedures map onto practical actor-critic methods that use bisimulation metrics for state representation learning. Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.