通过连续时间梯度更快的政策学习

论文标题

通过连续时间梯度更快的政策学习

Faster Policy Learning with Continuous-Time Gradients

论文作者

Ainsworth, Samuel, Lowrey, Kendall, Thickstun, John, Harchaoui, Zaid, Srinivasa, Siddhartha

论文摘要

我们研究具有已知动力学的连续时间系统的策略梯度的估计。通过在连续时间进行策略学习，我们表明可以构建一个更有效，更准确的梯度估计器。通过时间估计器（BPTT）的标准背部传播计算了连续时间系统粗离散的精确梯度。相反，我们在原始系统中近似连续梯度。有了估算连续时间梯度的明确目标，我们能够自适应地离散并构建一个更有效的策略梯度估计器，我们称之为连续的时间策略梯度（CTPG）。我们表明，用更有效的CTPG估算代替BPTT策略梯度会导致在各种控制任务和模拟器中更快，更强大的学习。

We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题