分布软批评：用于解决价值估计错误的差异加强学习

论文标题

分布软批评：用于解决价值估计错误的差异加强学习

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

论文作者

Duan, Jingliang, Guan, Yang, Li, Shengbo Eben, Ren, Yangang, Cheng, Bo

论文摘要

在强化学习（RL）中，已知函数近似错误很容易导致Q值高估，从而大大降低了策略绩效。本文介绍了一种分布软件批判性（DSAC）算法，该算法是一种用于连续控制设置的非政策RL方法，可通过减轻Q值高估来改善策略性能。我们首先在理论上发现，学习国家行动回报的分布函数可以有效地减轻Q值高估，因为它能够自适应地调整Q值函数的更新步骤。然后，通过将返回分布函数嵌入最大熵RL来开发分布软策略迭代（DSPI）框架。最后，我们提出了DSPI的深层演员 - 批评变体，称为DSAC，该变体直接通过保持状态行动回报的差异在合理的范围内，以解决爆炸和消失的梯度问题，从而直接学习了连续的回报分布。我们评估DSAC在Mujoco连续控制任务的套件上，以实现最新的性能。

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This paper presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update stepsize of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题