FP8量化：指数的功能

论文标题

FP8量化：指数的功能

FP8 Quantization: The Power of the Exponent

论文作者

Kuzmin, Andrey, Van Baalen, Mart, Ren, Yuwei, Nagel, Markus, Peters, Jorn, Blankevoort, Tijmen

论文摘要

当量化神经网络以进行有效推断时，低位整数是效率的首选格式。但是，低位浮点数具有额外的自由度，而是分配了一些以指数级的工作。本文深入研究了神经网络推断的浮点格式的这种好处。我们详细介绍了可以为FP8格式做出的选择，包括对Mantissa和Exponent的位数的重要选择，并通过分析显示这些选择可以提供更好的性能。然后，我们展示这些发现如何转化为真实网络，为FP8模拟提供有效的实现，以及一种新算法，该算法能够学习比例参数和FP8格式中指数位的数量。我们的主要结论是，在对各种网络进行培训后量化时，就准确性而言，FP8格式优于INT8，并且指数位数量的选择是由网络中异常值的严重性驱动的。我们还通过量化感知训练进行实验，在训练网络以降低异常值的效果时，格式的差异消失。

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题