互动自主的学习偏好

论文标题

互动自主的学习偏好

Learning Preferences for Interactive Autonomy

论文作者

Bıyık, Erdem

论文摘要

当机器人进入日常的人类环境时，他们需要了解他们的任务以及如何执行这些任务。为了编码这些，采用了指定机器人目标的奖励功能。但是，对于复杂的任务和环境，设计奖励功能可能非常具有挑战性。一种有希望的方法是从人类学习奖励功能。最近，一些机器人学习工作采用了这种方法，并利用人类的示威来学习奖励功能。这种方法被称为逆增强学习，依赖于一个基本假设：人类可以向机器人提供近乎最佳的示范。不幸的是，情况很少如此：由于各种原因（例如，远程运行的困难，机器人具有很高的自由度或人类的认知限制），对机器人的人类示范通常是次优的。本论文是通过使用其他更可靠的数据方式来学习从人用户学习奖励功能的尝试。具体而言，我们研究如何使用比较反馈来学习奖励功能，其中人类用户比较了多个机器人轨迹，而不是（或除）提供示范。为此，我们首先提出了各种形式的比较反馈，例如成对比较，最佳选择，排名，比较比较；并描述机器人如何使用各种形式的人类反馈来推断奖励函数，这可能是参数或非参数。接下来，我们提出主动学习技术，以使机器人能够提供比较反馈，以优化从该用户反馈中获得的预期信息。最后，我们证明了我们的方法在各种领域中的适用性，从自主驾驶模拟到家庭机器人技术，从标准的增强学习基准到低体型外骨骼。

When robots enter everyday human environments, they need to understand their tasks and how they should perform those tasks. To encode these, reward functions, which specify the objective of a robot, are employed. However, designing reward functions can be extremely challenging for complex tasks and environments. A promising approach is to learn reward functions from humans. Recently, several robot learning works embrace this approach and leverage human demonstrations to learn the reward functions. Known as inverse reinforcement learning, this approach relies on a fundamental assumption: humans can provide near-optimal demonstrations to the robot. Unfortunately, this is rarely the case: human demonstrations to the robot are often suboptimal due to various reasons, e.g., difficulty of teleoperation, robot having high degrees of freedom, or humans' cognitive limitations. This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities. Specifically, we study how reward functions can be learned using comparative feedback, in which the human user compares multiple robot trajectories instead of (or in addition to) providing demonstrations. To this end, we first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function, which may be parametric or non-parametric. Next, we propose active learning techniques to enable the robot to ask for comparison feedback that optimizes for the expected information that will be gained from that user feedback. Finally, we demonstrate the applicability of our methods in a wide variety of domains, ranging from autonomous driving simulations to home robotics, from standard reinforcement learning benchmarks to lower-body exoskeletons.

下载PDF全文

下载文献需遵守相关版权规定

论文标题