论文标题
学习合成环境和奖励网络以增强学习
Learning Synthetic Environments and Reward Networks for Reinforcement Learning
论文作者
论文摘要
我们介绍以神经网络为代表的合成环境(SES)和奖励网络(RN)作为训练增强学习(RL)代理的代理环境模型。我们表明,专门在SE上接受培训后,可以解决相应的真实环境。尽管SE通过了解其状态动态和奖励来充当真实环境的完整代理,但RN是一个部分学会增强或取代奖励的部分代理。我们使用双级优化来发展SES和RN:内部环训练RL试剂,外环通过进化策略训练SE / RN的参数。我们在广泛的RL算法和经典控制环境中评估了我们提出的新概念。在一对一的比较中,学习SE代理需要与实际环境相比,仅在真实环境上进行培训代理。但是,一旦学到了这样的SE,我们就不需要与真实环境的任何互动来培训新的代理商。此外,博学的SE代理使我们能够在保持原始任务性能的同时培训互动较少的代理。我们的经验结果表明,SES通过学习偏向代理人偏向相关状态的知情表示来实现这一结果。此外,我们发现这些代理对超参数变化具有鲁棒性,也可以转移到看不见的药物。
We introduce Synthetic Environments (SEs) and Reward Networks (RNs), represented by neural networks, as proxy environment models for training Reinforcement Learning (RL) agents. We show that an agent, after being trained exclusively on the SE, is able to solve the corresponding real environment. While an SE acts as a full proxy to a real environment by learning about its state dynamics and rewards, an RN is a partial proxy that learns to augment or replace rewards. We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy. We evaluate our proposed new concept on a broad range of RL algorithms and classic control environments. In a one-to-one comparison, learning an SE proxy requires more interactions with the real environment than training agents only on the real environment. However, once such an SE has been learned, we do not need any interactions with the real environment to train new agents. Moreover, the learned SE proxies allow us to train agents with fewer interactions while maintaining the original task performance. Our empirical results suggest that SEs achieve this result by learning informed representations that bias the agents towards relevant states. Moreover, we find that these proxies are robust against hyperparameter variation and can also transfer to unseen agents.