与本地更新的深网的平行培训

论文标题

与本地更新的深网的平行培训

Parallel Training of Deep Networks with Local Updates

论文作者

Laskin, Michael, Metz, Luke, Nabarro, Seth, Saroufim, Mark, Noune, Badreddine, Luschi, Carlo, Sohl-Dickstein, Jascha, Abbeel, Pieter

论文摘要

在大型数据集中训练的深度学习模型在视觉和语言领域都取得了广泛的成功。随着最先进的深度学习架构的参数数量持续增长，因此培训它们所需的计算预算和时间也增加，增加了对培训并行化培训的计算方法的需求。数据并行的两种平行方法是数据和模型并行性。尽管有用，但数据和模型并行性就大批量尺寸的计算效率而言，回报率降低。在本文中，我们调查了如何继续进行缩放的缩放，超过了通过局部并行性的大批次回报率降低的点，该框架通过用截短的层面反向propapagation替换全局反向传播来平行对深网中各个图层的训练。局部并行性可以完全异步的层面并行性，并且与模型并行性相比，几乎不需要通信开销。我们在各种体系结构中都展示了视觉和语言领域的结果，并发现本地并行性在高计算制度中特别有效。

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from diminishing returns in terms of compute efficiency for large batch sizes. In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.

下载PDF全文

下载文献需遵守相关版权规定

论文标题