通过软Mellowmax操作员稳定Q学习

论文标题

通过软Mellowmax操作员稳定Q学习

Stabilizing Q Learning Via Soft Mellowmax Operator

论文作者

Gan, Yaozhong, Zhang, Zhe, Tan, Xiaoyang

论文摘要

通过函数近似学习在高维状态空间中学习复杂的价值函数是一个具有挑战性的任务，部分原因是，在时间差更新中使用的最大操作员理论上可能导致大多数线性或非线性近似方案的不稳定性。 MellowMax是最近提出的可区分和非扩展软磁性操作员，可以在学习和计划中采取收敛行为。不幸的是，它收敛到固定点的性能仍然不清楚，实际上，其参数对各种域敏感，必须按照情况调整。最后，MellowMax操作员可能会遭受过度厚度的困扰，因为它忽略了每次汇总时要采取的概率。在本文中，我们用增强的MellowMax运算符（sm2（Soft MellowMax））解决了上述所有问题。特别是，提议的运营商可靠，易于实施，并具有可证明的性能保证，同时保留了MellowMax的所有优势。此外，我们表明我们的SM2操作员可以应用于具有挑战性的多项式增强学习方案，从而导致稳定的价值功能近似和最新性能的状态。

Learning complicated value functions in high dimensional state space by function approximation is a challenging task, partially due to that the max-operator used in temporal difference updates can theoretically cause instability for most linear or non-linear approximation schemes. Mellowmax is a recently proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning. Unfortunately, the performance bound for the fixed point it converges to remains unclear, and in practice, its parameter is sensitive to various domains and has to be tuned case by case. Finally, the Mellowmax operator may suffer from oversmoothing as it ignores the probability being taken for each action when aggregating them. In this paper, we address all the above issues with an enhanced Mellowmax operator, named SM2 (Soft Mellowmax). Particularly, the proposed operator is reliable, easy to implement, and has provable performance guarantee, while preserving all the advantages of Mellowmax. Furthermore, we show that our SM2 operator can be applied to the challenging multi-agent reinforcement learning scenarios, leading to stable value function approximation and state of the art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题