在线学习奖励功能的陷阱

论文标题

在线学习奖励功能的陷阱

Pitfalls of learning a reward function online

论文作者

Armstrong, Stuart, Leike, Jan, Orseau, Laurent, Legg, Shane

论文摘要

在某些代理设计中，诸如逆增强学习者学习代理需要学习自己的奖励功能。学习奖励功能和对其进行优化通常是两个不同的过程，通常在不同的阶段执行。我们考虑了一种连续的（``一生''）学习方法，在该方法中，代理商都学习奖励功能并同时对其进行优化。我们表明，这带有许多陷阱，例如故意在一个方向上操纵学习过程，拒绝学习，``学习''已知的事实是已经知道的事实，并做出严格主导的决策（对于所有相关的奖励功能）。我们正式介绍了两个理想的属性：第一个是“无法理解的性”，它阻止了代理在奖励函数方向方向转向学习过程，更易于优化。第二个是“不侵入性”，从而通过学习有关环境的事实来运作奖励功能学习过程。我们表明，一个不可侵害的过程自动是无法发现的，如果可能的环境集足够丰富，则相反也是如此。

In some agent designs like inverse reinforcement learning an agent needs to learn its own reward function. Learning the reward function and optimising for it are typically two different processes, usually performed at different stages. We consider a continual (``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time. We show that this comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction, refusing to learn, ``learning'' facts already known to the agent, and making decisions that are strictly dominated (for all relevant reward functions). We formally introduce two desirable properties: the first is `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise. The second is `uninfluenceability', whereby the reward-function learning process operates by learning facts about the environment. We show that an uninfluenceable process is automatically unriggable, and if the set of possible environments is sufficiently rich, the converse is true too.

下载PDF全文

下载文献需遵守相关版权规定

论文标题