DASGD：使用延迟平均的分布式训练中挤压SGD并行性能

论文标题

DASGD：使用延迟平均的分布式训练中挤压SGD并行性能

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

论文作者

Zhou, Qinggang, Zhang, Yawen, Li, Pengcheng, Liu, Xiaoyong, Yang, Jun, Wang, Runsheng, Huang, Ru

论文摘要

最先进的深度学习算法依靠分布式培训系统来解决越来越多的模型和培训数据集。 MINIBATCH随机梯度下降（SGD）算法要求工人停止前进/向后传播，等待所有工人汇总的渐变，并在下一批任务之前接收重量更新。该同步执行模型公开了分布式培训系统中大量工人之间梯度/重量通信的间接费用。我们提出了一种新的SGD算法，DASGD（延迟平均的本地SGD），该算法与SGD和前向/后传播并行，以隐藏100％的通信开销。通过调整梯度更新方案，该算法更有效地使用硬件资源，并降低了对低延迟和高通量相关的依赖。理论分析和实验结果表明其收敛速率O（1/sqrt（k））与SGD相同。性能评估表明，它可以通过群集大小进行线性性能扩展。

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagations, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks. This synchronous execution model exposes the overheads of gradient/weight communication among a large number of workers in a distributed training system. We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide 100% of the communication overhead. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. The theoretical analysis and the experimental results show its convergence rate O(1/sqrt(K)), the same as SGD. The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题