论文标题
灵巧的机器人操作,使用深厚的强化学习和知识转移,用于复杂的稀疏奖励任务
Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks
论文作者
论文摘要
本文描述了一种深入的增强学习方法(DRL)方法,该方法赢得了真正的机器人挑战(RRC)2021的第一阶段,然后将此方法扩展到更加困难的操纵任务。 RRC包括使用Trifinger机器人沿指定的位置轨迹来操纵立方体,但不需要立方体具有任何特定方向。我们使用了相对简单的奖励功能,基于目标的稀疏奖励和距离奖励的组合,并与Hindsight Experience重播(她)一起指导DRL代理的学习(深层确定性政策梯度(DDPG))。我们的方法使我们的代理商可以在模拟中获得灵活的机器人操纵策略。然后将这些策略应用于真正的机器人,并在RRC的最终评估阶段胜过所有其他竞争提交的提交,包括使用更传统的机器人控制技术的竞争提交。在这里,我们通过修改RRC的阶段1的任务来扩展此方法,以要求机器人在特定方向上维护立方体,同时沿着所需的位置轨迹移动立方体。由于问题的复杂性增加而导致代理商也无法通过盲目探索来学习任务的要求。为了避免此问题,我们对知识转移(KT)技术进行了新颖的态度,该技术允许代理商在原始任务中学到的策略(这是不可知论到多维数据集的方向),可以转移到此任务(方向很重要的情况下)。 KT允许代理在模拟器中学习和执行扩展任务,这将平均位置偏差从0.134 m提高到0.02 m,在评估过程中,平均方向偏差从142°到76°。这个KT概念显示出良好的概括属性,可以应用于任何参与者 - 批判性学习算法。
This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142° to 76° during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.