具有部分可观察到的连续非线性动力学的反相反控制

论文标题

具有部分可观察到的连续非线性动力学的反相反控制

Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics

论文作者

Kwon, Minhae, Daptardar, Saurabh, Schrater, Paul, Pitkow, Xaq

论文摘要

神经科学中的一个基本问题是，大脑如何创建一个世界内部模型，以使用模棱两可的感觉信息来指导动作。这是自然而然地作为在部分观察结果下的强化学习问题所表达的，在该观察结果下，代理必须从其证据中估算世界上相关的潜在变量，预测可能的未来状态，并选择优化总预期奖励的行动。控制理论可以解决此问题，这使我们能够找到给定系统动力学和目标功能的最佳动作。但是，动物通常表现出色。为什么？我们假设动物具有自己的世界内部模型，并根据有缺陷的模型选择具有最高预期主观奖励的动作。我们将这种行为描述为理性但不是最佳的。逆合理性控制（IRC）的问题旨在确定哪种内部模型可以最好地解释代理商的行为。我们在这里的贡献将过去的工作概括为反理性控制，该工作解决了在马尔可夫决策过程中以部分可观察到的离散控制的问题。在这里，我们适应了连续的非线性动力和连续的动作，并插入动物私人噪音破坏的感觉观察。我们首先建立了一种最佳的贝叶斯代理，该代理学习使用深度强化学习的整个动态和主观奖励的模型空间中概括了最佳政策。至关重要的是，这使我们能够计算出从次优剂中获得的实验可观察到的动作轨迹的模型上的可能性。然后，我们找到使用梯度上升来最大化可能性的模型参数。

A fundamental question in neuroscience is how the brain creates an internal model of the world to guide actions using sequences of ambiguous sensory information. This is naturally formulated as a reinforcement learning problem under partial observations, where an agent must estimate relevant latent variables in the world from its evidence, anticipate possible future states, and choose actions that optimize total expected reward. This problem can be solved by control theory, which allows us to find the optimal actions for a given system dynamics and objective function. However, animals often appear to behave suboptimally. Why? We hypothesize that animals have their own flawed internal model of the world, and choose actions with the highest expected subjective reward according to that flawed model. We describe this behavior as rational but not optimal. The problem of Inverse Rational Control (IRC) aims to identify which internal model would best explain an agent's actions. Our contribution here generalizes past work on Inverse Rational Control which solved this problem for discrete control in partially observable Markov decision processes. Here we accommodate continuous nonlinear dynamics and continuous actions, and impute sensory observations corrupted by unknown noise that is private to the animal. We first build an optimal Bayesian agent that learns an optimal policy generalized over the entire model space of dynamics and subjective rewards using deep reinforcement learning. Crucially, this allows us to compute a likelihood over models for experimentally observable action trajectories acquired from a suboptimal agent. We then find the model parameters that maximize the likelihood using gradient ascent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题