关于依据的模型 - 静态元元素的融合理论

论文标题

关于依据的模型 - 静态元元素的融合理论

On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning

论文作者

Fallah, Alireza, Georgiev, Kristian, Mokhtari, Aryan, Ozdaglar, Asuman

论文摘要

我们考虑使用模型不合时宜的元学习方法（MAML）方法（RL）问题，其中的目标是使用来自Markov决策过程（MDPS）代表的几个任务的数据找到策略，可以通过实现MDP的随机策略梯度的一步进行更新。特别是，在MAML更新步骤中使用随机梯度对于RL问题至关重要，因为精确梯度的计算需要访问大量可能的轨迹。对于此公式，我们提出了MAML方法的一种变体，即称为随机梯度元强化学习（SG-MRL），并研究其收敛性。我们得出了SG-MRL的迭代和样本复杂性，以找到$ε$ - first阶固定点，据我们所知，该点为模型 - 敏捷的元强化学习算法提供了第一个收敛保证。我们进一步展示了我们的结果如何扩展到在测试时间使用多个随机策略梯度方法一步的情况。最后，我们从经验上比较了几个深的RL环境中的SG-MRL和MAML。

We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $ε$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题