论文标题

马尔可夫决策过程的几何政策迭代

Geometric Policy Iteration for Markov Decision Processes

论文作者

Wu, Yue, De Loera, Jesús A.

论文摘要

最近发现了有限国家行动折扣马尔可夫决策过程(MDP)的价值函数的多面体结构(MDP)阐明了了解增强学习的成功。我们更详细地研究了值函数多层,并使用超平面布置表征多层边界。我们进一步表明,该值空间是同一超平面排列的许多有限的细胞的结合,并将其与MDP的经典线性编程公式的多元化相关联。受这些几何属性的启发,我们提出了一种新算法,几何策略迭代(GPI)来解决折扣的MDP。 GPI通过切换到映射到值函数polytope的边界的操作,然后立即更新值函数来更新单个状态的策略。该新的更新规则的目的是在不损害计算效率的情况下提高价值的提高。此外,我们的算法允许对状态值的异步更新,与传统政策迭代相比,该状态值更加灵活和有利。我们证明,GPI的复杂性达到了最著名的$ \ Mathcal {o} \ left(\ frac {| \ Mathcal {a} |} {1 - γ} \γ} \ log \ log \ frac {1} {1-γ} {1-γ} \ right)的策略迭代和empirine empirence nister and s Onders n s of sond s of gpi n.

Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\mathcal{O}\left(\frac{|\mathcal{A}|}{1 - γ}\log \frac{1}{1-γ}\right)$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源