从部分可观察到的环境中学习导航成本

论文标题

从部分可观察到的环境中学习导航成本

Learning Navigation Costs from Demonstration in Partially Observable Environments

论文作者

Wang, Tianyu, Dhiman, Vikas, Atanasov, Nikolay

论文摘要

本文着重于逆增强学习（IRL），以在未知的部分可观察到的环境中实现安全有效的自主导航。目的是推断出一种成本函数，该成本函数可以解释专家示意的导航行为，同时仅依靠专家使用的观察结果和状态控制轨迹。我们开发了一个由两个部分组成的成本函数表示：概率的占用编码器，反复依赖观察顺序以及一个成本编码器，定义为在占用特征上定义。通过区分演示控件和根据成本编码器计算的控制策略之间的误差来优化表示参数。这种差异通常是通过在整个状态空间上通过价值函数动态编程来计算的。我们观察到，在大部分可观察到的环境中，这是效率低下的，因为大多数状态都没有探索。取而代之的是，我们依赖于仅通过有效的运动规划算法（例如A*或RRT）在有希望状态的子集中获得的成本损失的封闭形式的子级别。我们的实验表明，我们的模型超出了机器人导航任务中基线IRL算法的准确性，同时大大提高了训练效率和测试时间推理的效率。

This paper focuses on inverse reinforcement learning (IRL) to enable safe and efficient autonomous navigation in unknown partially observable environments. The objective is to infer a cost function that explains expert-demonstrated navigation behavior while relying only on the observations and state-control trajectory used by the expert. We develop a cost function representation composed of two parts: a probabilistic occupancy encoder, with recurrent dependence on the observation sequence, and a cost encoder, defined over the occupancy features. The representation parameters are optimized by differentiating the error between demonstrated controls and a control policy computed from the cost encoder. Such differentiation is typically computed by dynamic programming through the value function over the whole state space. We observe that this is inefficient in large partially observable environments because most states are unexplored. Instead, we rely on a closed-form subgradient of the cost-to-go obtained only over a subset of promising states via an efficient motion-planning algorithm such as A* or RRT. Our experiments show that our model exceeds the accuracy of baseline IRL algorithms in robot navigation tasks, while substantially improving the efficiency of training and test-time inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题