论文标题
一致的辍学率梯度加强学习
Consistent Dropout for Policy Gradient Reinforcement Learning
论文作者
论文摘要
长期以来,辍学一直是监督学习的主食,但很少用于增强学习。我们分析了为什么Naive的辍学应用在策略梯度学习算法上是有问题的,并引入了一致的辍学,这是一种解决这种不稳定的简单技术。我们证明,在各种辍学概率的连续和离散的动作环境中,在连续和离散的动作环境中都可以通过A2C和PPO进行稳定训练。最后,我们表明一致的辍学能够在线培训复杂的体系结构,例如GPT,而无需禁用模型的本机辍学。
Dropout has long been a staple of supervised learning, but is rarely used in reinforcement learning. We analyze why naive application of dropout is problematic for policy-gradient learning algorithms and introduce consistent dropout, a simple technique to address this instability. We demonstrate consistent dropout enables stable training with A2C and PPO in both continuous and discrete action environments across a wide range of dropout probabilities. Finally, we show that consistent dropout enables the online training of complex architectures such as GPT without needing to disable the model's native dropout.