基于进化的多目标增强学习基于轨迹控制和任务卸载在UAV辅助移动边缘计算中

论文标题

基于进化的多目标增强学习基于轨迹控制和任务卸载在UAV辅助移动边缘计算中

Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing

论文作者

Song, Fuhong, Xing, Huanlai, Wang, Xinhan, Luo, Shouxi, Dai, Penglin, Xiao, Zhiwen, Zhao, Bowen

论文摘要

本文研究了无人机（UAV）辅助移动边缘计算系统中的轨迹控制和任务卸载（TCTO）问题，在该系统中，无人机沿着计划的轨迹飞行，从智能设备（SDS）收集计算任务。我们考虑了一个场景，即SD不是由基站（BS）直接连接的，并且无人机具有两个角色可以扮演：MEC服务器或无线继电器。无人机使任务卸载决策在线，其中收集的任务可以在无人机上本地执行或卸载到BS进行远程处理。 TCTO问题涉及多目标优化，因为其目标是最大程度地减少任务延迟和无人机的能耗，并同时使无人机收集的任务数量最大化。这个问题具有挑战性，因为这三个目标彼此冲突。现有的增强学习（RL）算法，即单目标RLS或单极多目标RLS，无法很好地解决该问题，因为它们无法在单个运行中跨目标输出多个偏好（即权重）的多个策略。本文将多政策的多目标RL改编成Evolutionary多目标RL（EMORL），以适应TCTO问题。该算法只能在一项运行中输出多个最佳策略，每个算法都优化了一定优先级。模拟结果表明，与两种进化和两种多政策RL算法相比，提出的算法可以通过在策略质量的三个目标之间达到平衡来获得更出色的非主导策略。

This paper studies the trajectory control and task offloading (TCTO) problem in an unmanned aerial vehicle (UAV)-assisted mobile edge computing system, where a UAV flies along a planned trajectory to collect computation tasks from smart devices (SDs). We consider a scenario that SDs are not directly connected by the base station (BS) and the UAV has two roles to play: MEC server or wireless relay. The UAV makes task offloading decisions online, in which the collected tasks can be executed locally on the UAV or offloaded to the BS for remote processing. The TCTO problem involves multi-objective optimization as its objectives are to minimize the task delay and the UAV's energy consumption, and maximize the number of tasks collected by the UAV, simultaneously. This problem is challenging because the three objectives conflict with each other. The existing reinforcement learning (RL) algorithms, either single-objective RLs or single-policy multi-objective RLs, cannot well address the problem since they cannot output multiple policies for various preferences (i.e. weights) across objectives in a single run. This paper adapts the evolutionary multi-objective RL (EMORL), a multi-policy multi-objective RL, to the TCTO problem. This algorithm can output multiple optimal policies in just one run, each optimizing a certain preference. The simulation results demonstrate that the proposed algorithm can obtain more excellent nondominated policies by striking a balance between the three objectives regarding policy quality, compared with two evolutionary and two multi-policy RL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题