软上隐含的探索，在软马克斯策略梯度和神经复制器动力学之间插值

论文标题

软上隐含的探索，在软马克斯策略梯度和神经复制器动力学之间插值

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

论文作者

Morrill, Dustin, Saleh, Esra'a, Bowling, Michael, Greenwald, Amy

论文摘要

神经复制器动力学（NEURD）是由在线学习和进化游戏理论激励的基础软效果梯度（SPG）算法的替代方案。神经预期的更新旨在与SPG几乎相同，但是，我们表明蒙特卡洛更新以实质性的方式不同：在SPG更新中，对采样操作的重要性校正占了，但在Neurd更新中却没有。自然，这会导致神经更新的差异高于其SPG对应物。在对抗性匪徒环境中的隐式探索算法的基础上，我们引入了限制的隐式探索（CIX）估计，该估计使我们能够构造神经循环，该神经循环介于Neurd和SPG的这一方面。我们展示了如何将CIX估计值用于降低黑盒中，以构建具有较高概率的遗憾界限的匪徒算法，并且在顺序决策设置中，这对Neurd-cix带来的好处。我们的分析揭示了SPG和Neurd之间的偏见 - 差异权衡，并表明理论如何预测神经循环的表现将比神经差保持良好，同时在非平稳环境中保留了SPG的优势。

Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high probability and the benefits this entails for NeuRD-CIX in sequential decision-making settings. Our analysis reveals a bias--variance tradeoff between SPG and NeuRD, and shows how theory predicts that NeuRD-CIX will perform well more consistently than NeuRD while retaining NeuRD's advantages over SPG in non-stationary environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题