我们什么时候应该更喜欢离线增强学习而不是行为克隆？

论文标题

我们什么时候应该更喜欢离线增强学习而不是行为克隆？

When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?

论文作者

Kumar, Aviral, Hong, Joey, Singh, Anikait, Levine, Sergey

论文摘要

离线增强学习（RL）算法可以通过使用先前收集的经验来获得有效的政策，而无需任何在线互动。众所周知，离线RL即使从高度次优的数据中提取了良好的策略，在这种情况下，模仿学习可以找到次优的解决方案，这些解决方案无法改善生成数据集的演示器。但是，从业者的另一种常见用例是从类似于演示的数据中学习。在这种情况下，可以选择应用离线RL，但也可以使用行为克隆（BC）算法，该算法通过监督学习模仿数据集的一个子集。因此，似乎很自然地问：即使在卑诗省是自然的选择的情况下，离线RL方法何时可以超过卑诗省的表现？为了回答这个问题，我们表征了允许离线RL方法比BC方法更好的环境的属性，即使仅提供了专家数据。此外，我们表明，经过足够嘈杂的次优数据培训的政策可以比具有专家数据的BC算法，尤其是在长马问题上的bc算法更好。我们通过对诊断和高维域（包括机器人操纵，迷宫导航和Atari游戏）的广泛实验来验证我们的理论结果，并具有多种数据分布。我们观察到，在特定但常见的条件下，例如稀疏的奖励或嘈杂的数据源，现代的离线RL方法可以显着超过BC。

Offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing previously collected experience, without any online interaction. It is widely understood that offline RL is able to extract good policies even from highly suboptimal data, a scenario where imitation learning finds suboptimal solutions that do not improve over the demonstrator that generated the dataset. However, another common use case for practitioners is to learn from data that resembles demonstrations. In this case, one can choose to apply offline RL, but can also use behavioral cloning (BC) algorithms, which mimic a subset of the dataset via supervised learning. Therefore, it seems natural to ask: when can an offline RL method outperform BC with an equal amount of expert data, even when BC is a natural choice? To answer this question, we characterize the properties of environments that allow offline RL methods to perform better than BC methods, even when only provided with expert data. Additionally, we show that policies trained on sufficiently noisy suboptimal data can attain better performance than even BC algorithms with expert data, especially on long-horizon problems. We validate our theoretical results via extensive experiments on both diagnostic and high-dimensional domains including robotic manipulation, maze navigation, and Atari games, with a variety of data distributions. We observe that, under specific but common conditions such as sparse rewards or noisy data sources, modern offline RL methods can significantly outperform BC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题