马尔可夫奖励流程中的折扣价值的循环估计器

论文标题

马尔可夫奖励流程中的折扣价值的循环估计器

Loop Estimator for Discounted Values in Markov Reward Processes

论文作者

Dai, Falcon Z., Walter, Matthew R.

论文摘要

在政策迭代算法的工作中心，在强化学习的折扣设置中常用和研究，政策评估步骤估算了通过在马尔可夫决策过程中遵循马尔可夫政策引起的马尔可夫奖励过程的样本的国家价值。我们提出了一个称为LOOP估计器的简单高效估计器，该估计器可利用马尔可夫奖励过程的再生结构，而无需明确估计完整模型。当估计单个正面复发状态$ s $的值（$ o（s）$或基于型号的方法）带有$ o \ left（s^2 \ right）$的TD时，我们的方法享有$ O（1）$的空间复杂性。 Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}\left(\sqrt{τ_s/T}\right)$ over steps $T$ on a single sample path, where $τ_s$ is the maximal expected hitting time to state $s$.在初步的数值实验中，循环估计器优于TD（K）等无模型方法，并且与基于模型的估计器具有竞争力。

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based methods with $O\left(S^2\right)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}\left(\sqrt{τ_s/T}\right)$ over steps $T$ on a single sample path, where $τ_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题