论文标题
实时控制物理机器人的异步增强学习
Asynchronous Reinforcement Learning for Real-Time Control of Physical Robots
论文作者
论文摘要
现实世界中强化学习的经常挑战是,当代理进行学习更新时,现实世界不会停下来。由于标准的模拟环境无法解决学习的这个实时方面,因此,大多数可用的RL算法实现过程环境交互和学习依次更新。结果,当将这种实施部署在现实世界中时,他们可能会根据明显延迟的观察结果做出决定,而不是响应性地采取行动。已经提出了异步学习来解决这个问题,但是使用现实世界环境进行了顺序和异步增强学习之间的系统比较。在这项工作中,我们通过机器人组设置了两个基于视觉的任务,该任务实现了一个不同步的学习系统,该系统扩展了先前的体系结构,并在不同的动作周期时间,感觉数据维度以及迷你批量的大小上比较了顺序和异步增强学习。我们的实验表明,当学习更新的时间成本增加时,顺序实现中的动作周期时间可能会长期长,而异步实现始终可以保持适当的动作周期时间。因此,当学习更新很昂贵时,顺序学习的表现会降低,并且通过大量学习的差距出于异步学习的优于。我们的系统实时学习,以在经验的两个小时内从像素中触及和跟踪视觉目标,并使用真实的机器人直接从头开始学习。
An oft-ignored challenge of real-world reinforcement learning is that the real world does not pause when agents make learning updates. As standard simulated environments do not address this real-time aspect of learning, most available implementations of RL algorithms process environment interactions and learning updates sequentially. As a consequence, when such implementations are deployed in the real world, they may make decisions based on significantly delayed observations and not act responsively. Asynchronous learning has been proposed to solve this issue, but no systematic comparison between sequential and asynchronous reinforcement learning was conducted using real-world environments. In this work, we set up two vision-based tasks with a robotic arm, implement an asynchronous learning system that extends a previous architecture, and compare sequential and asynchronous reinforcement learning across different action cycle times, sensory data dimensions, and mini-batch sizes. Our experiments show that when the time cost of learning updates increases, the action cycle time in sequential implementation could grow excessively long, while the asynchronous implementation can always maintain an appropriate action cycle time. Consequently, when learning updates are expensive, the performance of sequential learning diminishes and is outperformed by asynchronous learning by a substantial margin. Our system learns in real-time to reach and track visual targets from pixels within two hours of experience and does so directly using real robots, learning completely from scratch.