论文标题

定义和表征奖励黑客

Defining and Characterizing Reward Hacking

论文作者

Skalse, Joar, Howe, Nikolaus H. R., Krasheninnikov, Dmitrii, Krueger, David

论文摘要

我们提供了奖励黑客的第一个正式定义,这是一种现象,其中优化不完美的代理奖励功能会根据真实的奖励功能导致性能不佳。我们说,如果增加预期的代理收益永远不会减少预期的真实回报,则代理是不可接受的。从直觉上讲,可以通过从奖励功能(使其“较窄”)中留出一些术语或忽略大致等效结果之间的细粒度区分来创建一个不可接受的代理,但是我们表明情况通常不是这种情况。一个关键的见解是,奖励的线性(在国家行动访问计数中)使得无法实现的状态非常强。特别是,对于所有随机策略的集合,只有在其中一个是恒定的,只有两个奖励函数才能是不可接受的。因此,我们将注意力转移到确定性的政策和有限的随机策略集中,在这些策略中始终存在非平淡无奇的对,并为简化的存在建立了必要和充分的条件,这是一种重要的特殊情况。我们的结果表明,使用奖励功能指定狭窄的任务与对齐人类价值观的AI系统之间的张力。

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function leads to poor performance according to the true reward function. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it "narrower") or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源