自然策略梯度原始偶对偶偶联方法的收敛性和样本复杂性，用于受约束的MDP

论文标题

自然策略梯度原始偶对偶偶联方法的收敛性和样本复杂性，用于受约束的MDP

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

论文作者

Ding, Dongsheng, Zhang, Kaiqing, Duan, Jiali, Başar, Tamer, Jovanović, Mihailo R.

论文摘要

我们研究旨在最大化预期总奖励的顺序决策问题，同时满足预期的总效用的限制。我们采用自然政策梯度方法来解决约束的马尔可夫决策过程（约束MDP）的折现无限 - 摩恩最佳控制问题。具体而言，我们提出了一种新的自然策略梯度原始偶（NPG-PD）方法，该方法通过自然策略梯度上升来更新原始变量，并通过投影的次级下降来更新二元变量。尽管潜在的最大化涉及非循环目标函数和非convex约束集，但在SoftMax策略参数化下，我们证明我们的方法在最佳差距和约束违规方面都可以与均方根率达到全球收敛性。这种收敛性与国家行动空间的大小无关，即无维度。此外，对于日志线性和一般平滑策略参数化，我们建立了均方根收敛速率，直至由受限制策略参数化引起的函数近似误差。我们还为两种基于样本的NPG-PD算法提供了收敛和有限样本复杂性。最后，我们使用计算实验来展示我们方法的优点和有效性。

We study sequential decision making problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we use computational experiments to showcase the merits and the effectiveness of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题