论文标题

通过重播估算

Minimax Optimal Online Imitation Learning via Replay Estimation

论文作者

Swamy, Gokul, Rajaraman, Nived, Peng, Matthew, Choudhury, Sanjiban, Bagnell, J. Andrew, Wu, Zhiwei Steven, Jiao, Jiantao, Ramchandran, Kannan

论文摘要

在线模仿学习是如何最好地访问环境或准确的模拟器的问题的问题。先前的工作表明,在无限的样本制度中,匹配的确切力矩达到了与专家政策相等的价值。但是,在有限的样本制度中,即使没有优化错误,经验差异也会导致性能差距,该差距以$ h^2 / n $用于行为克隆的缩放,而在线计时匹配则$ h / \ sqrt {n} $,其中$ h $ is the Horizo​​n is the Horizo​​n and $ n $是专家数据集的大小。我们介绍了重播估算的技术以减少这种经验差异:通过反复在随机模拟器中执行缓存的专家行动,我们计算了一个更平滑的专家访问分布估算以匹配的。在存在一般函数近似的情况下,我们证明了一个元定理,从而减少了离线分类参数估计误差的方法差距(即学习专家策略)。在表格设置或线性函数近似中,我们的元定理表明,我们方法所产生的性能差距达到了最佳的$ \ widetilde {o} \ left(\ min(\ min({h^{3/2}}}} / {n} / {n},{h} / {h} / {h} / {\ sqrt {工作。我们在几个连续的控制任务上实施了多种方法,发现我们能够显着提高各种数据集大小的政策性能。

Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源