落后的几点：量化的激活函数的记忆足迹降低

论文标题

落后的几点：量化的激活函数的记忆足迹降低

Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

论文作者

Novikov, Georgii, Bershatsky, Daniel, Gusak, Julia, Shonenkov, Alex, Dimitrov, Denis, Oseledets, Ivan

论文摘要

记忆足迹是大型神经网络训练的主要限制因素之一。在反向传播中，需要在计算图中存储每个操作的输入。每个现代的神经网络模型在其体系结构中都具有相当多的非线性，并且这种操作会导致额外的记忆成本（正如我们所显示的）可以通过梯度的量化大大降低。我们提出了一种系统的方法来计算点式非线性函数的保留梯度的最佳量化，每个元素只有几个位。我们表明，可以通过计算激活函数导数的最佳分段构恒定近似来实现这种近似，这可以通过动态编程来完成。置换式替换均针对所有流行的非线性实现，并且可以在任何现有管道中使用。我们在几个开放基准上确认记忆的减少和相同的收敛性。

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题