通过深厚的增强学习排队网络控制

论文标题

通过深厚的增强学习排队网络控制

Queueing Network Controls via Deep Reinforcement Learning

论文作者

Dai, J. G., Gluzman, Mark

论文摘要

新颖的高级政策梯度（APG）方法，例如信任区域政策优化和近端政策优化（PPO），由于其易于实施和良好的实践绩效，已成为主要的强化学习算法。臭名昭著的排队网络控制问题的常规设置是马尔可夫决策问题（MDP），具有三个功能：无限状态空间，无限成本和长期的平均成本目标。我们将这些APG方法的理论框架扩展到此类MDP问题。在平行服务器系统和大尺寸的多类排队网络上测试了最终的PPO算法。该算法始终产生控制政策，在各种负载条件下，从光到繁忙的交通中，在文献中的最先进启发式方法都超过了策略。当可以计算最佳策略时，这些政策被证明是几乎最佳的。我们PPO算法成功的关键是使用三种差异技术在通过采样估计相对值函数时使用。首先，我们使用折现的相对值函数作为相对值函数的近似值。其次，我们提出了再生模拟，以估算折现的相对价值函数。最后，我们将近似的Martingale-Process方法纳入再生估计量。

Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the optimal policy can be computed. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function via sampling. First, we use a discounted relative value function as an approximation of the relative value function. Second, we propose regenerative simulation to estimate the discounted relative value function. Finally, we incorporate the approximating martingale-process method into the regenerative estimator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题