天空计算：在联合学习中加速地理分布计算

论文标题

天空计算：在联合学习中加速地理分布计算

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

论文作者

Zhu, Jie, Li, Shenggui, You, Yang

论文摘要

Google提出了联合学习，以通过在用户的设备上本地培训模型来保护数据隐私。但是，随着深度学习模型的规模增长以取得更好的结果，很难在一种单个设备上适应整个模型。因此，然后使用模型并行性将模型权重划分为几个设备。通过这种逻辑，该方法当前使用均匀地使用设备之间的权重。但是，实际上，可能是由不同用户设备的变异计算能力产生的计算瓶颈。为了解决此问题，需要负载平衡来根据设备的计算能力分配模型权重。在本文中，我们提出了Sky Computing，这是一种负载平衡的模型并行性框架，以适应权重分配给设备。当训练160层BERT使用64个节点训练时，Sky计算的训练时间比基线法优于基线方法55％。可以在https://github.com/hpcaitech/skycomputing上找到源代码。

Federated learning is proposed by Google to safeguard data privacy through training models locally on users' devices. However, with deep learning models growing in size to achieve better results, it becomes increasingly difficult to accommodate the whole model on one single device. Thus, model parallelism is then used to divide the model weights among several devices. With this logic, the approach currently used evenly allocates weights among devices. However, in reality, a computation bottleneck may occur resulting from variant computing power of different users' devices. To address this problem, load balancing is needed to allocate the model weights based on the computational capability of the device. In this paper, we proposed Sky Computing, a load-balanced model parallelism framework to adaptively allocate the weights to devices. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT with 64 nodes. The source code can be found at https://github.com/hpcaitech/SkyComputing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题