数据并行培训中重量更新的自动跨更高碎片

论文标题

数据并行培训中重量更新的自动跨更高碎片

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

论文作者

Xu, Yuanzhong, Lee, HyoukJoong, Chen, Dehao, Choi, Hongjun, Hechtman, Blake, Wang, Shibo

论文摘要

在深层神经网络的数据并行同步训练中，不同的设备（副本）运行相同的程序，并使用训练批次的不同分区来运行，但是重量更新计算在所有副本上都重复，因为权重没有分区的批量维度。这可能是具有较大权重的典型语言模型的性能和可伸缩性的瓶颈，并且具有较小的每次批量批次尺寸的型号，这在大规模训练中是典型的。本文介绍了一种方法，可以使用静态分析和训练计算图上的静态分析和转换，以有效的通信基底和数据格式自动将重量更新计算跨副本进行。我们显示，这项技术在云TPU上的典型图像和语言模型上实现了实质性加速，不需要更改模型代码。这项技术有助于缩小传统上昂贵（ADAM）和廉价（SGD）优化器之间的差距，因为它们只会在训练时间的一小部分中进行训练时间，并且具有相似的峰值内存使用情况。它帮助我们在Google的MLPERF 0.6提交中实现了最先进的培训表现。

In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not have a batch dimension to partition. This can be a bottleneck for performance and scalability in typical language models with large weights, and models with small per-replica batch size which is typical in large-scale training. This paper presents an approach to automatically shard the weight update computation across replicas with efficient communication primitives and data formatting, using static analysis and transformations on the training computation graph. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs, requiring no change to model code. This technique helps close the gap between traditionally expensive (ADAM) and cheap (SGD) optimizers, as they will only take a small part of training step time and have similar peak memory usage. It helped us to achieve state-of-the-art training performance in Google's MLPerf 0.6 submission.

下载PDF全文

下载文献需遵守相关版权规定

论文标题