非平稳风险敏感的强化学习：近乎最佳的动态遗憾，适应性检测和分离设计

论文标题

非平稳风险敏感的强化学习：近乎最佳的动态遗憾，适应性检测和分离设计

Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal Dynamic Regret, Adaptive Detection, and Separation Design

论文作者

Ding, Yuhao, Jin, Ming, Lavaei, Javad

论文摘要

我们根据情节非平稳马尔可夫决策过程（MDPS）中的熵风险度量（MDP）研究风险敏感的增强学习（RL）。奖励功能和国家过渡内核都是未知的，并且随着时间的推移而随着预算的累积变化而任意变化。当该变化预算是先前的，我们提出了两种基于重新启动的算法，即重新启动RSMB和RESTART-RSQ，并确定他们的动态遗憾。基于这些结果，我们进一步提出了一个不需要任何变化预算的知识的元偏金，并且可以适应性地检测到指数值函数的非平稳性。然后，为非平稳风险敏感的RL建立了动态的遗憾下限，以证明所提出的算法的近距离观察。我们的结果还表明，如果已知变异预算是先验的，而自适应算法中的非平稳检测机制则取决于风险参数，则可以在算法中分别设计风险控制和处理非平稳性。这项工作为文献中非平稳风险敏感的RL提供了第一个非反应理论分析。

We study risk-sensitive reinforcement learning (RL) based on an entropic risk measure in episodic non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition kernels are unknown and allowed to vary arbitrarily over time with a budget on their cumulative variations. When this variation budget is known a prior, we propose two restart-based algorithms, namely Restart-RSMB and Restart-RSQ, and establish their dynamic regrets. Based on these results, we further present a meta-algorithm that does not require any prior knowledge of the variation budget and can adaptively detect the non-stationarity on the exponential value functions. A dynamic regret lower bound is then established for non-stationary risk-sensitive RL to certify the near-optimality of the proposed algorithms. Our results also show that the risk control and the handling of the non-stationarity can be separately designed in the algorithm if the variation budget is known a prior, while the non-stationary detection mechanism in the adaptive algorithm depends on the risk parameter. This work offers the first non-asymptotic theoretical analyses for the non-stationary risk-sensitive RL in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题