与演员批评方法的保守优势学习同时进行双Q学习

论文标题

与演员批评方法的保守优势学习同时进行双Q学习

Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods

论文作者

Li, Qing, Zhou, Wengang, Lu, Zhenbo, Li, Houqiang

论文摘要

参与者批判性的增强学习（RL）算法在连续的控制任务中取得了令人印象深刻的表现。但是，它们仍然遭受两个非平凡的障碍，即样品效率低和高估偏差。为此，我们建议通过保守的优势学习（SDQ-CAL）同时进行双重学习。我们的SDQ-CAL基于对Bellman-Optimal-Optimality Operation的修改，可以通过优势学习的修改来增强非政策演员 - 批评RL的双重学习。具体而言，SDQ-CAL通过修改奖励来提高样品效率，以促进与最佳动作与其他作用之间的经验的区别。此外，它通过同时在双重估计器上同时更新一对批评者来减轻高估问题。广泛的实验表明，我们的算法在一系列连续的控制基准任务中实现了偏差的价值估计，并实现了最先进的性能。我们在：\ url {https://github.com/lqnew/sdq-cal}中发布我们方法的源代码。

Actor-critic Reinforcement Learning (RL) algorithms have achieved impressive performance in continuous control tasks. However, they still suffer two nontrivial obstacles, i.e., low sample efficiency and overestimation bias. To this end, we propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL). Our SDQ-CAL boosts the Double Q-learning for off-policy actor-critic RL based on a modification of the Bellman optimality operator with Advantage Learning. Specifically, SDQ-CAL improves sample efficiency by modifying the reward to facilitate the distinction from experience between the optimal actions and the others. Besides, it mitigates the overestimation issue by updating a pair of critics simultaneously upon double estimators. Extensive experiments reveal that our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks. We release the source code of our method at: \url{https://github.com/LQNew/SDQ-CAL}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题