论文标题
零命令监督政策改进
Zeroth-Order Supervised Policy Improvement
论文作者
论文摘要
策略梯度(PG)算法已被广泛用于增强学习(RL)。但是,PG算法依赖于在本地使用一阶更新来利用所学的值函数,从而导致样本效率有限。在这项工作中,我们提出了一种称为零订单监督政策改进(ZOSPI)的替代方法。 ZOSPI在全球范围内利用估计的值函数$ Q $,同时根据零订单策略优化保留对PG方法的本地开发。这种学习范式遵循Q学习,但克服了在连续动作空间中有效操作Argmax的困难。它在少数样本中找到了最大值的作用。 ZOSPI的政策学习有两个步骤:首先,它采样行动并用学习价值估算器评估这些动作,然后学会通过监督学习来以最高的价值执行动作。我们进一步证明了这样的有监督的学习框架可以学习多模式政策。实验表明,ZOSPI以显着的样本效率在连续的控制基准上取得了竞争成果。
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.