针对分布式DNN培训的特定领域的沟通优化

论文标题

针对分布式DNN培训的特定领域的沟通优化

Domain-specific Communication Optimization for Distributed DNN Training

论文作者

Wang, Hao, Chen, Jingrong, Wan, Xinchen, Tian, Han, Xia, Jiacheng, Zeng, Gaoxiong, Wang, Weiyan, Chen, Kai, Bai, Wei, Jiang, Junchen

论文摘要

沟通开销构成了分布DNN培训的重要障碍，并引起了近年来越来越多的关注。尽管持续努力，但先前的解决方案，例如梯度压缩/减少，计算/通信重叠和层面流程调度等，仍然是粗粒的，不足以进行有效的分布式培训，尤其是当网络处于压力下时。我们提出了DLCP，这是一种利用深度学习的领域特定特性，以细粒度的方式优化DNN培训开销的域特征。 DLCP的核心是超越先前工作的几项关键创新：例如，它利用了基于SGD的培训的{\ em有限的损失公差}来改善尾巴通信潜伏期，这是无法避免通过梯度压缩来避免的。然后，它基于层和梯度大小，以进一步加速模型收敛而不会影响准确性，而不是流量级调度，而不是流量级调度，它进行了细粒度的数据包级优先级和下降。此外，它利用包装间订单独立性来执行每包负载平衡，而不会引起经典的重新排序问题。 DLCP与参数服务器和集体通信例程一起使用。我们已经通过商品开关实施了DLCP，并将其与包括Tensorflow，Mxnet和Pytorch在内的各种培训框架集成在一起，并将其部署在我们的小型测试台中，并使用10个NVIDIA V100 GPU。我们的测试床实验和大规模模拟表明，与最佳现有解决方案相比，DLCP最高$ 84.3 \％$额外的培训加速。

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and layer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits {\em bounded loss tolerance} of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to $84.3\%$ additional training acceleration over the best existing solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题