论文标题

非平稳风险敏感的强化学习:近乎最佳的动态遗憾,适应性检测和分离设计

Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal Dynamic Regret, Adaptive Detection, and Separation Design

论文作者

Ding, Yuhao, Jin, Ming, Lavaei, Javad

论文摘要

我们根据情节非平稳马尔可夫决策过程(MDPS)中的熵风险度量(MDP)研究风险敏感的增强学习(RL)。奖励功能和国家过渡内核都是未知的,并且随着时间的推移而随着预算的累积变化而任意变化。当该变化预算是先前的,我们提出了两种基于重新启动的算法,即重新启动RSMB和RESTART-RSQ,并确定他们的动态遗憾。基于这些结果,我们进一步提出了一个不需要任何变化预算的知识的元偏金,并且可以适应性地检测到指数值函数的非平稳性。然后,为非平稳风险敏感的RL建立了动态​​的遗憾下限,以证明所提出的算法的近距离观察。我们的结果还表明,如果已知变异预算是先验的,而自适应算法中的非平稳检测机制则取决于风险参数,则可以在算法中分别设计风险控制和处理非平稳性。这项工作为文献中非平稳风险敏感的RL提供了第一个非反应理论分析。

We study risk-sensitive reinforcement learning (RL) based on an entropic risk measure in episodic non-stationary Markov decision processes (MDPs). Both the reward functions and the state transition kernels are unknown and allowed to vary arbitrarily over time with a budget on their cumulative variations. When this variation budget is known a prior, we propose two restart-based algorithms, namely Restart-RSMB and Restart-RSQ, and establish their dynamic regrets. Based on these results, we further present a meta-algorithm that does not require any prior knowledge of the variation budget and can adaptively detect the non-stationarity on the exponential value functions. A dynamic regret lower bound is then established for non-stationary risk-sensitive RL to certify the near-optimality of the proposed algorithms. Our results also show that the risk control and the handling of the non-stationarity can be separately designed in the algorithm if the variation budget is known a prior, while the non-stationary detection mechanism in the adaptive algorithm depends on the risk parameter. This work offers the first non-asymptotic theoretical analyses for the non-stationary risk-sensitive RL in the literature.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源