不同幅度的稀疏奖励在学习速度的基于模型的演员评论家方法的速度中的影响

论文标题

不同幅度的稀疏奖励在学习速度的基于模型的演员评论家方法的速度中的影响

Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods

论文作者

Vargas, Juan, Andjelic, Lazar, Farimani, Amir Barati

论文摘要

演员评论家的方法在基于模型的深度强化学习中具有稀疏奖励的方法通常需要确定性的二进制奖励功能，仅反映两个可能的结果：如果为了每个步骤，是否实现了目标。我们的假设是，我们可以通过在训练过程中施加外部环境压力来影响代理，以更快地学习，这会对其获得更高奖励的能力产生不利影响。 As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5)在使用负面奖励策略时，随机积极的激励措施会产生不利影响，使某些条件下的代理人学习得不佳，更慢。这些结果已被证明对使用事后的经验重播在众所周知的Mujoco环境中重播有效，但我们认为它们也可以推广到其他方法和环境。

Actor critic methods with sparse rewards in model-based deep reinforcement learning typically require a deterministic binary reward function that reflects only two possible outcomes: if, for each step, the goal has been achieved or not. Our hypothesis is that we can influence an agent to learn faster by applying an external environmental pressure during training, which adversely impacts its ability to get higher rewards. As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5) that random positive incentives have an adverse effect when using a negative reward strategy, making an agent under those conditions learn poorly and more slowly. These results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment, but we argue that they could be generalized to other methods and environments as well.

下载PDF全文

下载文献需遵守相关版权规定

论文标题