受约束的MDP的接近最佳样品复杂性界限

论文标题

受约束的MDP的接近最佳样品复杂性界限

Near-Optimal Sample Complexity Bounds for Constrained MDPs

论文作者

Vaswani, Sharan, Yang, Lin F., Szepesvári, Csaba

论文摘要

与表征解决马尔可夫决策过程（MDP）样品复杂性的进步相反，解决约束MDP（CMDP）的最佳统计复杂性仍然未知。我们通过在折扣CMDP中学习近乎最佳策略的样本复杂性上的最小上限和下限来解决这个问题，并访问生成模型（Simulator）。特别是，我们设计了一种基于模型的算法，该算法解决了两个设置：（i）轻松的可行性，允许违反小小的约束，以及（ii）严格的可行性，需要输出策略以满足约束。 For (i), we prove that our algorithm returns an $ε$-optimal policy with probability $1 - δ$, by making $\tilde{O}\left(\frac{S A \log(1/δ)}{(1 - γ)^3 ε^2}\right)$ queries to the generative model, thus matching the sample-complexity for unconstrained MDP。对于（ii），我们表明该算法的样本复杂性在$ \ tilde {o} \ left（\ frac {s a a \，log，\ log（1/δ）} {（1 - γ）^5 \，ε^2}^2}^2} \ private $ quardip的sleptration sl的情况下，f rackipers的尺寸是fulake的sl fortation forter fulage sl fortation fulage forception the forter funderage fortation frofe forter的尺寸。最后，我们证明了严格可行性设置的匹配较低限制，因此获得了折扣CMDP的第一个近最小最佳界限。我们的结果表明，在允许违反小小的约束时，学习CMDP与MDP一样容易，但是当我们要求零约束违规时，本质上更加困难。

In contrast to the advances in characterizing the sample complexity for solving Markov decision processes (MDPs), the optimal statistical complexity for solving constrained MDPs (CMDPs) remains unknown. We resolve this question by providing minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP with access to a generative model (simulator). In particular, we design a model-based algorithm that addresses two settings: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to satisfy the constraint. For (i), we prove that our algorithm returns an $ε$-optimal policy with probability $1 - δ$, by making $\tilde{O}\left(\frac{S A \log(1/δ)}{(1 - γ)^3 ε^2}\right)$ queries to the generative model, thus matching the sample-complexity for unconstrained MDPs. For (ii), we show that the algorithm's sample complexity is upper-bounded by $\tilde{O} \left(\frac{S A \, \log(1/δ)}{(1 - γ)^5 \, ε^2 ζ^2} \right)$ where $ζ$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we prove a matching lower-bound for the strict feasibility setting, thus obtaining the first near minimax optimal bounds for discounted CMDPs. Our results show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题