使用极限确定性的Büchi自动机对线性时间逻辑规格的控制政策的强化学习政策

论文标题

使用极限确定性的Büchi自动机对线性时间逻辑规格的控制政策的强化学习政策

Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Büchi Automata

论文作者

Oura, Ryohei, Sakakibara, Ami, Ushio, Toshimitsu

论文摘要

这封信提出了一种新颖的强化学习方法，用于综合控制策略，以满足线性时间逻辑公式描述的控制规范。我们假设受控系统是由马尔可夫决策过程（MDP）建模的。我们将规范转换为极限确定的广义Büchi自动机（LDGBA），几个接受集可以接受所有满足该公式的无限序列。 LDGBA被增强，以便明确记录以前的访问为接受集。我们采用增强的LDGBA和MDP的产品，根据该产品定义奖励功能。每当国家过渡处于一个未访问的一定步骤的接收集中，代理就会获得奖励。因此，奖励的稀疏性是放松的，并且在接受集合中学习了最佳的循环。我们表明，当折现因子足够接近一个时，提出的方法可以学习最佳策略。

This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We convert the specification to a limit-deterministic generalized Büchi automaton (LDGBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDGBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDGBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.

下载PDF全文

下载文献需遵守相关版权规定

论文标题