帕累托确定性政策梯度及其在5G大型MIMO网络中的应用

论文标题

帕累托确定性政策梯度及其在5G大型MIMO网络中的应用

Pareto Deterministic Policy Gradients and Its Application in 5G Massive MIMO Networks

论文作者

Zhou, Zhou, Xin, Yan, Chen, Hao, Zhang, Charlie, Liu, Lingjia

论文摘要

在本文中，我们考虑通过增强学习（RL）方法共同优化单元载荷平衡和网络吞吐量，在此方法中，彼此间切换（即用户关联分配）和大规模的MIMO天线倾斜度被配置为学习的RL政策。我们使用RL背后的理由是避免分析对用户移动性和网络动态建模的挑战。为了实现此联合优化，我们将向量奖励集成到RL值网络中，并通过单独的策略网络进行RL行动。我们将此方法命名为帕累托确定性政策梯度（PDPG）。它是一种参与者批评，无模型和确定性的政策算法，可以通过以下两个优点来处理耦合目标：1）它通过利用矢量自由度奖励的程度解决了优化，而不是选择手工制作的标量奖励； 2）可以显着降低多个策略的交叉验证。因此，启用RL的网络以自组织的方式行为：它通过测量历史记录来学习潜在的用户移动性，以主动操作移交和天线倾斜，而无需环境假设。我们的数值评估表明，引入的RL方法优于基于标量的方法。同时，要独立地，将理想的基于静态优化的蛮力搜索求解器作为基准包括在内。比较表明，RL方法的性能以及这种理想的策略，尽管前者受到有限的环境观测和较低的动作频率的约束，而后者则可以完全访问用户移动性。根据实际情况的测量数据，我们在不同的用户移动性环境下还测试了我们引入的方法的融合。

In this paper, we consider jointly optimizing cell load balance and network throughput via a reinforcement learning (RL) approach, where inter-cell handover (i.e., user association assignment) and massive MIMO antenna tilting are configured as the RL policy to learn. Our rationale behind using RL is to circumvent the challenges of analytically modeling user mobility and network dynamics. To accomplish this joint optimization, we integrate vector rewards into the RL value network and conduct RL action via a separate policy network. We name this method as Pareto deterministic policy gradients (PDPG). It is an actor-critic, model-free and deterministic policy algorithm which can handle the coupling objectives with the following two merits: 1) It solves the optimization via leveraging the degree of freedom of vector reward as opposed to choosing handcrafted scalar-reward; 2) Cross-validation over multiple policies can be significantly reduced. Accordingly, the RL enabled network behaves in a self-organized way: It learns out the underlying user mobility through measurement history to proactively operate handover and antenna tilt without environment assumptions. Our numerical evaluation demonstrates that the introduced RL method outperforms scalar-reward based approaches. Meanwhile, to be self-contained, an ideal static optimization based brute-force search solver is included as a benchmark. The comparison shows that the RL approach performs as well as this ideal strategy, though the former one is constrained with limited environment observations and lower action frequency, whereas the latter ones have full access to the user mobility. The convergence of our introduced approach is also tested under different user mobility environment based on our measurement data from a real scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题