论文标题
Actor-Critic算法的近似政策迭代观点
An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms
论文作者
论文摘要
在这项工作中,我们考虑基于政策的方法来解决强化学习问题,并确定样本复杂性保证。一种基于政策的算法通常由演员和评论家组成。我们考虑为演员(包括著名的自然政策梯度)使用各种政策更新规则。与文献中采取的梯度上升方法相反,我们将自然政策梯度视为实施政策迭代的近似方法,并表明自然政策梯度(没有任何正则化)在使用增加的步骤时享有几何融合。至于评论家,我们考虑使用线性函数近似和非政策采样的TD学习。由于众所周知,在这种设置中,TD学习可能是不稳定的,因此我们提出了一种稳定的通用算法(包括两种特定算法:$λ$ - 平均$ q $ - trace和两面$ q $ -Trace),使用多步返回和一般性的质量返回和普遍的重要性采样因素,并提供有限的样本分析。将演员的几何融合与评论家的有限样本分析相结合,我们首次建立了总体$ \ Mathcal {o}(ε^{ - 2})$样品复杂性,用于查找最佳策略(最大函数近似错误),使用基于策略的方法在非电池采样和线性函数下使用基于策略的方法。
In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $λ$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(ε^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.