GITFL：使用版本控制的自适应异步联合学习

论文标题

GITFL：使用版本控制的自适应异步联合学习

GitFL: Adaptive Asynchronous Federated Learning using Version Control

论文作者

Hu, Ming, Xia, Zeke, Yue, Zhihao, Xia, Jun, Huang, Yihao, Liu, Yang, Chen, Mingsong

论文摘要

作为一个有前途的分布式机器学习范式，可以在不损害数据隐私的情况下进行协作培训，联合学习（FL）已越来越多地用于Aiot（事物的人工智能）设计中。但是，由于缺乏对散落装置的有效管理，现有的FL方法极大地遭受了推理准确性低和较长训练时间的问题。考虑到Aiot场景中存在各种不确定因素（例如，由过程变化引起的绩效差异），情况变得更糟。为了解决这个问题，本文提出了一个名为GITFL的新型异步FL框架，该框架的实现灵感来自著名的版本控制系统Git。与传统的FL不同，GITFL的云服务器以及一组分支模型保持了主模型（即全局模型），该模型指示了所选设备所做的训练有素的本地模型，其中主模型基于所有推送的分支模型及其版本信息，并且仅将拉力模型派遣后，将其用于设备。通过使用我们提出的加固学习（RL）基于设备的选择机制，具有较旧版本的拉分支模型更有可能被派往更快，更常见的设备，以进行下一轮本地培训。通过这种方式，GITFL可以有效控制模型的稳定性和散落装置中版本模型的自适应负载平衡，从而避免了性能恶化。众所周知的模型和数据集的全面实验结果表明，与最新的异步FL方法相比，GITFL可以实现多达2.64倍的训练加速度，而在各种不确定情况下的推理精度提高了7.88％。

As a promising distributed machine learning paradigm that enables collaborative training without compromising data privacy, Federated Learning (FL) has been increasingly used in AIoT (Artificial Intelligence of Things) design. However, due to the lack of efficient management of straggling devices, existing FL methods greatly suffer from the problems of low inference accuracy and long training time. Things become even worse when taking various uncertain factors (e.g., network delays, performance variances caused by process variation) existing in AIoT scenarios into account. To address this issue, this paper proposes a novel asynchronous FL framework named GitFL, whose implementation is inspired by the famous version control system Git. Unlike traditional FL, the cloud server of GitFL maintains a master model (i.e., the global model) together with a set of branch models indicating the trained local models committed by selected devices, where the master model is updated based on both all the pushed branch models and their version information, and only the branch models after the pull operation are dispatched to devices. By using our proposed Reinforcement Learning (RL)-based device selection mechanism, a pulled branch model with an older version will be more likely to be dispatched to a faster and less frequently selected device for the next round of local training. In this way, GitFL enables both effective control of model staleness and adaptive load balance of versioned models among straggling devices, thus avoiding the performance deterioration. Comprehensive experimental results on well-known models and datasets show that, compared with state-of-the-art asynchronous FL methods, GitFL can achieve up to 2.64X training acceleration and 7.88% inference accuracy improvements in various uncertain scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题