论文标题
学习无限 - 摩恩斯平均奖励马尔可夫决策过程
Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints
论文作者
论文摘要
我们研究了在成本限制下的无限 - 马尔马尔可夫决策过程(MDP)的遗憾最小化。我们首先设计一种具有精心设计的动作值估算器和奖励术语的策略优化算法,并证明对于Ergodic MDPS,我们的算法确保$ \ widetilde {o}(\ sqrt {t} {t})$遗憾和持续约束违规,其中$ t $是$ t $是时间步骤的总数。这严格改善了(Singh等,2020)的算法,其遗憾和约束违规都是$ \ widetilde {o}(t^{2/3})$。接下来,我们将考虑最弱的沟通MDP。通过有限的 - 荷里琴近似,我们开发了另一种算法,并使用$ \ widetilde {o}(t^{2/3})$遗憾和违法行为,可以将其进一步改进到$ \ wideTilde {o}(o}(o)(\ sqrt {t})$通过简单的修改,使albeitification Albeithm concutation concutation concutation。据我们所知,这些是第一个具有成本限制的MDP的可证明算法。
We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $\widetilde{O}(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $\widetilde{O}(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $\widetilde{O}(T^{2/3})$ regret and constraint violation, which can be further improved to $\widetilde{O}(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.