多步分配加固学习中时间差异错误的性质

论文标题

多步分配加固学习中时间差异错误的性质

The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning

论文作者

Tang, Yunhao, Rowland, Mark, Munos, Rémi, Pires, Bernardo Ávila, Dabney, Will, Bellemare, Marc G.

论文摘要

我们研究了分销RL的多步非货币学习方法。尽管基于价值的RL和分布RL之间的相似性明显相似，但我们的研究揭示了多步环境中两种情况之间的有趣和根本差异。我们确定了依赖路径依赖性分布TD误差的新颖概念，这对于原则上的多步分布RL是必不可少的。基于价值的情况的区别对诸如后视算法等概念的重要含义具有重要意义。我们的工作提供了多步非政策分布RL算法的第一个理论保证，其中包括适用于少数现有方法的多步分配RL方法的结果。此外，我们得出了一种新型算法，即分位数回归 - 重新序列，该算法导致了深度RL Agent QR-DQN-RETRACE，显示出对Atari-57基准上QR-DQN的经验改进。总的来说，我们阐明了多步分布RL中如何在理论和实践中解决多个独特的挑战。

We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals intriguing and fundamental differences between the two cases in the multi-step setting. We identify a novel notion of path-dependent distributional TD error, which is indispensable for principled multi-step distributional RL. The distinction from the value-based case bears important implications on concepts such as backward-view algorithms. Our work provides the first theoretical guarantees on multi-step off-policy distributional RL algorithms, including results that apply to the small number of existing approaches to multi-step distributional RL. In addition, we derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57 benchmark. Collectively, we shed light on how unique challenges in multi-step distributional RL can be addressed both in theory and practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题