零命令监督政策改进

论文标题

零命令监督政策改进

Zeroth-Order Supervised Policy Improvement

论文作者

Sun, Hao, Xu, Ziping, Song, Yuhang, Fang, Meng, Xiong, Jiechao, Dai, Bo, Zhou, Bolei

论文摘要

策略梯度（PG）算法已被广泛用于增强学习（RL）。但是，PG算法依赖于在本地使用一阶更新来利用所学的值函数，从而导致样本效率有限。在这项工作中，我们提出了一种称为零订单监督政策改进（ZOSPI）的替代方法。 ZOSPI在全球范围内利用估计的值函数$ Q $，同时根据零订单策略优化保留对PG方法的本地开发。这种学习范式遵循Q学习，但克服了在连续动作空间中有效操作Argmax的困难。它在少数样本中找到了最大值的作用。 ZOSPI的政策学习有两个步骤：首先，它采样行动并用学习价值估算器评估这些动作，然后学会通过监督学习来以最高的价值执行动作。我们进一步证明了这样的有监督的学习框架可以学习多模式政策。实验表明，ZOSPI以显着的样本效率在连续的控制基准上取得了竞争成果。

Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题