论文标题

在线稀疏增强学习

Online Sparse Reinforcement Learning

论文作者

Hao, Botao, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi

论文摘要

我们研究了在固定地平线,稀疏线性马尔可夫决策过程(MDP)的在线增强学习的硬度,并特别关注高维度的环境维度大于情节数量。我们的贡献是两个方面。首先,我们提供了一个下限,即使存在收集良好条件数据的策略,在这种情况下,线性遗憾通常是不可避免的。下边界的结构使用具有固定数量状态的MDP,而动作数量则具有环境维度。请注意,当地平线固定在一个线性随机匪徒的情况下时,可以避免线性遗憾。其次,我们表明,如果学习者可以访问收集条件良好的数据的策略,那么套件拟合的Q-材料的一种几乎是无维度的遗憾,对$ \ tilde {o}(s^{2/3} n^{2/3} n^{2/3})$ n $ n $ n $ n $ n $ nas $ n $ assity and $ s $ s sassity sassity sassiTy。这表明,在大型环境中,学习的困难可能归因于寻找良好的探索性政策的困难。

We investigate the hardness of online reinforcement learning in fixed horizon, sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable in this case, even if there exists a policy that collects well-conditioned data. The lower bound construction uses an MDP with a fixed number of states while the number of actions scales with the ambient dimension. Note that when the horizon is fixed to one, the case of linear stochastic bandits, the linear regret can be avoided. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data then a variant of Lasso fitted Q-iteration enjoys a nearly dimension-free regret of $\tilde{O}( s^{2/3} N^{2/3})$ where $N$ is the number of episodes and $s$ is the sparsity level. This shows that in the large-action setting, the difficulty of learning can be attributed to the difficulty of finding a good exploratory policy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源