沟通高效的分布式深度学习：一项全面调查

论文标题

沟通高效的分布式深度学习：一项全面调查

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

论文作者

Tang, Zhenheng, Shi, Shaohuai, Wang, Wei, Li, Bo, Chu, Xiaowen

论文摘要

近年来，由于较大的模型和数据集，分布式深度学习（DL）已经普遍存在减少训练时间（例如GPU/TPU）。但是，系统可伸缩性受到通信成为性能瓶颈的限制。解决这个交流问题已成为一个著名的研究主题。在本文中，我们对沟通效率的分布式培训算法进行了全面的调查，重点介绍了系统级和算法级别的优化。我们首先提出了数据并行分布式培训算法的分类法，该算法包含四个主要维度：通信同步，系统体系结构，压缩技术以及通信和计算任务的并行性。然后，我们研究了解决这四个维度问题的最先进研究。我们还比较了不同算法的收敛速率，以了解它们的收敛速度。此外，我们进行了广泛的实验，以比较各种主流分布式训练算法的收敛性能。根据我们的系统级通信成本分析，理论和实验收敛速度比较，我们为读者提供了了解哪些算法在特定分布式环境下更有效的。我们的研究还推断了潜在的方向以进行进一步的优化。

Distributed deep learning (DL) has become prevalent in recent years to reduce training time by leveraging multiple computing devices (e.g., GPUs/TPUs) due to larger models and datasets. However, system scalability is limited by communication becoming the performance bottleneck. Addressing this communication issue has become a prominent research topic. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms, focusing on both system-level and algorithmic-level optimizations. We first propose a taxonomy of data-parallel distributed training algorithms that incorporates four primary dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing tasks. We then investigate state-of-the-art studies that address problems in these four dimensions. We also compare the convergence rates of different algorithms to understand their convergence speed. Additionally, we conduct extensive experiments to empirically compare the convergence performance of various mainstream distributed training algorithms. Based on our system-level communication cost analysis, theoretical and experimental convergence speed comparison, we provide readers with an understanding of which algorithms are more efficient under specific distributed environments. Our research also extrapolates potential directions for further optimizations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题