奖励不是必需的：如何为终身学习创建模块化和组成自我保护的代理

论文标题

奖励不是必需的：如何为终身学习创建模块化和组成自我保护的代理

Reward is not Necessary: How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning

论文作者

Ringstrom, Thomas J.

论文摘要

强化学习将奖励的最大化和避免惩罚视为解释目标指导行为的核心。但是，在一生中，生物体将需要了解世界结构的许多不同方面：世界的状态和国家媒介过渡动态。随着代理的结合，状态的组合数量呈指数增长，并且没有明显的加权组合，即定义为给定国家组合定义的既有奖励或成本，因为这样的权重需要编码有关代理商在世界上经历的好与坏组合的信息。因此，我们必须在大型状态空间中对行为和动机进行更自然的描述。我们表明，只能仅使用赋权的内在动机指标，该指标衡量了代理商在过渡操作员下实现许多可能的未来的能力。我们建议通过使用操作员钟手方程将授权扩展到层次状态空间。这些方程式产生了国家时间的可行性功能，这些功能是组成分层的状态时间过渡运算符，绘制代理人开始对最终状态和完成目标的策略开始策略的初始状态和时间。由于这些功能是层次运算符，因此我们可以定义其上的分层授权措施。然后，代理可以优化遥远国家和时间的计划，以最大程度地提高其分层授权获得，从而发现其目标，从而使其内部结构（生理状态）更加有利的耦合到其外部环境（世界结构和空间状态）。因此，终身代理人可以主要由组成和授权的原则来动画，这表现出自我的成长和维护自己的结构完整性，而无需求助于奖励最大化。

Reinforcement Learning views the maximization of rewards and avoidance of punishments as central to explaining goal-directed behavior. However, over a life, organisms will need to learn about many different aspects of the world's structure: the states of the world and state-vector transition dynamics. The number of combinations of states grows exponentially as an agent incorporates new knowledge, and there is no obvious weighted combination of pre-existing rewards or costs defined for a given combination of states, as such a weighting would need to encode information about good and bad combinations prior to an agent's experience in the world. Therefore, we must develop more naturalistic accounts of behavior and motivation in large state-spaces. We show that it is possible to use only the intrinsic motivation metric of empowerment, which measures the agent's capacity to realize many possible futures under a transition operator. We propose to scale empowerment to hierarchical state-spaces by using Operator Bellman Equations. These equations produce state-time feasibility functions, which are compositional hierarchical state-time transition operators that map an initial state and time when an agent begins a policy to the final states and times of completing a goal. Because these functions are hierarchical operators we can define hierarchical empowerment measures on them. An agent can then optimize plans to distant states and times to maximize its hierarchical empowerment-gain, allowing it to discover goals that bring about a more favorable coupling of its internal structure (physiological states) to its external environment (world structure & spatial state). Life-long agents could therefore be primarily animated by principles of compositionality and empowerment, exhibiting self-concern for the growth & maintenance of their own structural integrity without recourse to reward-maximization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题