论文标题
自适应Q学习
Control with adaptive Q-learning
论文作者
论文摘要
本文评估了两个经典的控制问题(Pendulum and Cartpole),评估了自适应Q学习(AQL)和单分区自适应Q学习(SPAQL),这是两种用于有效无模型发作增强学习(RL)的算法。 AQL在学习控制策略时自适应地分配了马尔可夫决策过程(MDP)的国家行动空间。例如,从州到行动的映射。 AQL和SPAQL之间的主要区别在于,后者学习时间不变的策略,其中从状态到操作的映射不取决于时间步长。本文还提出了具有终端状态(SPAQL-TS)的SPAQL,这是针对控制问题的调节器设计的改进的SPAQL版本。时间不变的策略显示出比所研究的两个问题中的时间变化的策略更好的性能。这些算法特别适合在动作空间有限的RL问题,就像Cartpole问题一样。 SPAQL-TS解决了OpenAi Gym Cartpole问题,同时还显示出比信任区域策略优化(TRPO)更高的样本效率,这是一种用于解决控制任务的标准RL算法。此外,SPAQL学到的政策是可以解释的,而TRPO策略通常被编码为神经网络,因此很难解释。 SPAQL的主要优势是在样本效率的同时产生可解释的政策。
This paper evaluates adaptive Q-learning (AQL) and single-partition adaptive Q-learning (SPAQL), two algorithms for efficient model-free episodic reinforcement learning (RL), in two classical control problems (Pendulum and Cartpole). AQL adaptively partitions the state-action space of a Markov decision process (MDP), while learning the control policy, i. e., the mapping from states to actions. The main difference between AQL and SPAQL is that the latter learns time-invariant policies, where the mapping from states to actions does not depend explicitly on the time step. This paper also proposes the SPAQL with terminal state (SPAQL-TS), an improved version of SPAQL tailored for the design of regulators for control problems. The time-invariant policies are shown to result in a better performance than the time-variant ones in both problems studied. These algorithms are particularly fitted to RL problems where the action space is finite, as is the case with the Cartpole problem. SPAQL-TS solves the OpenAI Gym Cartpole problem, while also displaying a higher sample efficiency than trust region policy optimization (TRPO), a standard RL algorithm for solving control tasks. Moreover, the policies learned by SPAQL are interpretable, while TRPO policies are typically encoded as neural networks, and therefore hard to interpret. Yielding interpretable policies while being sample-efficient are the major advantages of SPAQL.