论文标题
RTOP-K:分布式SGD的统计估计方法
rTop-k: A Statistical Estimation Approach to Distributed SGD
论文作者
论文摘要
在不同节点之间交换梯度的较大沟通成本显着限制了大规模学习模型的分布式培训的可扩展性。受到这一观察的激励,最近对降低分布式随机梯度下降(SGD)的通信成本的技术引起了重大兴趣,其梯度稀疏技术(例如TOP-K和Random-K)表现为特别有效。同样的观察也促使分布式统计估计理论中的单独工作,着重于沟通约束对不同统计模型估计效率的影响。本文的主要目的是连接这两个研究行,并证明统计估计模型及其分析如何导致在沟通高效培训技术设计方面的新见解。我们为随机梯度提出了一个简单的统计估计模型,该模型捕获了它们的分布的稀疏性和偏度。该模型的分析产生的统计最佳通信方案导致SGD的新稀疏技术,该技术在先前的文献中分别考虑了Random-K和TOP-K的连接。我们通过CIFAR-10,ImageNet和Penn Treebank数据集在图像和语言域上进行了广泛的实验,表明了这两种稀疏方法的串联应用始终如一,并且显着优于单独使用的任何一种方法。
The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.