通过基于行动的预期回报差异来识别关键状态

论文标题

通过基于行动的预期回报差异来识别关键状态

Identifying Critical States by the Action-Based Variance of Expected Return

论文作者

Karino, Izumi, Ohmura, Yoshiyuki, Kuniyoshi, Yasuo

论文摘要

探索和剥削的平衡在加速加强学习（RL）中起着至关重要的作用。要在人类社会中部署RL代理，其解释性也是必不可少的。但是，基本的RL方法在决定何时选择剥削以及提取有用点以简要说明其操作方面存在困难。造成困难的原因之一是这些方法以相同的方式对待所有状态。在这里，我们表明，识别关键状态并特别对待它们通常对这两个问题都是有益的。这些关键状态是行动选择会大大改变成功和失败的潜力的状态。我们建议使用Q-功能中的差异来识别关键状态，并在确定的状态上以很高的概率进行剥削。这些简单的方法在带有悬崖的网格世界中加速了RL，并且具有深层RL的两个基线任务。我们的结果还表明，确定的关键状态在行动选择的关键性质上可以直观地解释。此外，我们对识别特别关键状态的时间和学习快速进步之间的关系的分析表明，有一些特别关键的状态具有重要的信息来迅速加速RL。

The balance of exploration and exploitation plays a crucial role in accelerating reinforcement learning (RL). To deploy an RL agent in human society, its explainability is also essential. However, basic RL approaches have difficulties in deciding when to choose exploitation as well as in extracting useful points for a brief explanation of its operation. One reason for the difficulties is that these approaches treat all states the same way. Here, we show that identifying critical states and treating them specially is commonly beneficial to both problems. These critical states are the states at which the action selection changes the potential of success and failure substantially. We propose to identify the critical states using the variance in the Q-function for the actions and to perform exploitation with high probability on the identified states. These simple methods accelerate RL in a grid world with cliffs and two baseline tasks of deep RL. Our results also demonstrate that the identified critical states are intuitively interpretable regarding the crucial nature of the action selection. Furthermore, our analysis of the relationship between the timing of the identification of especially critical states and the rapid progress of learning suggests there are a few especially critical states that have important information for accelerating RL rapidly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题