深贝叶斯正交政策优化

论文标题

深贝叶斯正交政策优化

Deep Bayesian Quadrature Policy Optimization

论文作者

Tej, Akella Ravi, Azizzadenesheli, Kamyar, Ghavamzadeh, Mohammad, Anandkumar, Anima, Yue, Yisong

论文摘要

我们研究了使用有限数量的样品获得准确的策略梯度估计的问题。尽管梯度估计差异很大，但蒙特卡罗方法还是默认的策略梯度估计选择。另一方面，由于其高计算复杂性，更有效的替代方法（例如贝叶斯正交方法）很少受到关注。在这项工作中，我们提出了深贝叶斯正交策略梯度（DBQPG），这是对贝叶斯正交的计算有效的高维概括，以进行策略梯度估计。我们表明，DBQPG可以在策略梯度方法中替代蒙特卡洛估计，并在一组连续的控制基准中证明其有效性。与蒙特 - 卡洛估计相比，DBQPG提供（i）更准确的梯度估计值，差异明显较低，（ii）几种深层策略梯度算法的样品复杂性和平均回报均不断提高，以及（iii）梯度估计的不确定性，可以纳入可提高性能。

We study the problem of obtaining accurate policy gradient estimates using a finite number of samples. Monte-Carlo methods have been the default choice for policy gradient estimation, despite suffering from high variance in the gradient estimates. On the other hand, more sample efficient alternatives like Bayesian quadrature methods have received little attention due to their high computational complexity. In this work, we propose deep Bayesian quadrature policy gradient (DBQPG), a computationally efficient high-dimensional generalization of Bayesian quadrature, for policy gradient estimation. We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks. In comparison to Monte-Carlo estimation, DBQPG provides (i) more accurate gradient estimates with a significantly lower variance, (ii) a consistent improvement in the sample complexity and average return for several deep policy gradient algorithms, and, (iii) the uncertainty in gradient estimation that can be incorporated to further improve the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题