论文标题
通过强大而平稳的政策深入强化学习
Deep Reinforcement Learning with Robust and Smooth Policy
论文作者
论文摘要
深度强化学习(RL)在各个领域取得了巨大的经验成功。但是,神经网络的较大搜索空间需要大量数据,这使得当前的RL算法不能有效采样。由于许多具有连续状态空间的环境具有平稳的过渡,我们提议学习一种平稳的政策,该政策相对于国家的行为顺利。我们开发一个新的框架-XTEXTBF {S} MOOTH \ TEXTBF {R} egularized \ TextBf {r} Einforection \ textbf {l} enning($ \ textbf {sr}^2 \ textbf}^2 \ textbf {l textbf {l} $),在该策略中训练了平稳的态度,并定期训练。这种正则化有效地限制了搜索空间,并在学习的策略中实施平稳性。此外,我们提出的框架还可以提高政策在状态空间中的测量误差的鲁棒性,并且可以自然扩展到分配稳健的设置。我们将提出的框架应用于政策(TRPO)和范围算法(DDPG)。通过广泛的实验,我们证明了我们的方法可以提高样品效率和鲁棒性。
Deep reinforcement learning (RL) has achieved great empirical successes in various domains. However, the large search space of neural networks requires a large amount of data, which makes the current RL algorithms not sample efficient. Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy. Moreover, our proposed framework can also improve the robustness of policy against measurement error in the state space, and can be naturally extended to distribubutionally robust setting. We apply the proposed framework to both on-policy (TRPO) and off-policy algorithm (DDPG). Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness.