论文标题
L-greco:层次自适应梯度压缩,以高效而准确的深度学习
L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning
论文作者
论文摘要
深度神经网络(DNN)的数据并行分布培训已获得非常广泛的采用,但仍然可以体验到通信瓶颈。为了解决这个问题,已经开发了整个压缩机制的家庭,包括量化,稀疏和近似近似,其中一些人看到了大量的实际采用。尽管取得了这种进展,但几乎所有已知的压缩方案在DNN层上均匀地采用压缩,尽管层在参数计数及其对模型准确性的影响方面都是异质的。在这项工作中,我们提供了一个通用框架,以在训练过程中动态地调节模型层的压缩程度,改善整体压缩,同时导致大量加速,而无需牺牲准确性。我们的框架称为L-Greco,基于自适应算法,该算法会自动选择保证最佳压缩比的模型层的最佳压缩参数,同时满足错误约束。对图像分类和语言建模任务进行了广泛的实验表明,L-Greco在所有现有的压缩方法中都是有效的,并且可实现高达2.5 $ \ times $ thimes $训练的速度,并且最多可高达5 $ \ times $压缩改进现有方法的有效实现,同时恢复了完全准确的准确性。此外,L-Greco与现有的自适应算法相辅相成,将其压缩率提高了50%,实际吞吐量提高了66%。
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks. To address this issue, entire families of compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, improving the overall compression, while leading to substantial speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while satisfying an error constraint. Extensive experiments over image classification and language modeling tasks shows that L-GreCo is effective across all existing families of compression methods, and achieves up to 2.5$\times$ training speedup and up to 5$\times$ compression improvement over efficient implementations of existing approaches, while recovering full accuracy. Moreover, L-GreCo is complementary to existing adaptive algorithms, improving their compression ratio by 50% and practical throughput by 66%.