论文标题
时间差异增强学习的自适应树备份算法
Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning
论文作者
论文摘要
Q($σ$)是一种最近提出的时间差异学习方法,可在预期备份和采样备份之间进行学习。已经表明,插值参数$σ\在[0,1] $中的中间值在实践中的表现更好,因此通常认为$σ$是实现这些改进的偏置变化权衡参数。在我们的工作中,我们反驳了这一概念,表明$σ= 0 $的选择可以最大程度地减少差异而不会增加偏见。这表明$σ$必须对学习的其他影响,而这些影响尚未完全理解。作为替代方案,我们假设存在新的权衡:较大的$σ$值有助于克服价值函数的差初始化,而牺牲了较高的统计差异。为了自动平衡这些考虑因素,我们提出了自适应树备份(ATB)方法,随着代理的获得经验,其加权备份会发展。我们的实验表明,自适应策略比依靠固定或停机的$σ$值更有效。
Q($σ$) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter $σ\in [0,1]$ perform better in practice, and therefore it is commonly believed that $σ$ functions as a bias-variance trade-off parameter to achieve these improvements. In our work, we disprove this notion, showing that the choice of $σ=0$ minimizes variance without increasing bias. This indicates that $σ$ must have some other effect on learning that is not fully understood. As an alternative, we hypothesize the existence of a new trade-off: larger $σ$-values help overcome poor initializations of the value function, at the expense of higher statistical variance. To automatically balance these considerations, we propose Adaptive Tree Backup (ATB) methods, whose weighted backups evolve as the agent gains experience. Our experiments demonstrate that adaptive strategies can be more effective than relying on fixed or time-annealed $σ$-values.