通过0/1 Adam提高大规模培训的沟通效率

论文标题

通过0/1 Adam提高大规模培训的沟通效率

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

论文作者

Lu, Yucheng, Li, Conglong, Zhang, Minjia, De Sa, Christopher, He, Yuxiong

论文摘要

1位梯度压缩和局部步骤是两种代表性技术，可以使分布式SGD的巨大通信减少。但是，他们的好处仍然是基于亚当的大型模型预训练（例如Bert和GPT）的一个公开问题。在本文中，我们证明，即使单独应用1位压缩或局部步骤，ADAM中的非线性也会引起缓慢的收敛性。为了减轻这一限制，我们提出了0/1 ADAM，该ADAM通过使用其陈旧估计和线性相关性近似其优化器状态来线性化。 0/1亚当执行类似亚当的步骤来保留适应性，而其线性允许同时利用1位压缩和本地步骤，以提高墙壁 - 锁定时间的速度。我们在平滑的非凸目标目标上为0/1 ADAM提供收敛保证。 On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam;同时在胶水数据集和ImageNet验证集上享受相同的统计收敛速度和最终任务模型的精度。

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2$\times$ higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题