通过溢出意识量化加速神经网络推断

论文标题

通过溢出意识量化加速神经网络推断

Accelerating Neural Network Inference by Overflow Aware Quantization

论文作者

Xie, Hongwei, Zhang, Shuo, Ding, Huanghao, Song, Yafei, Shao, Baitao, Hu, Conggang, Cai, Ling, Li, Mingyang

论文摘要

深度神经网络的固有重型计算阻止了其广泛的应用。通过使用固定点值替换网络的输入操作数，一种广泛使用的用于加速模型推断的方法是量化。然后，大多数计算成本都集中在整数矩阵乘法积累上。实际上，高位蓄能器会导致部分浪费的计算，而低位通常会遭受数值溢出。为了解决此问题，我们通过设计可训练的自适应定点表示形式提出了一种溢出意识到的量化方法，以优化每个输入张量的位数，同时禁止计算过程中数字溢出。通过提出的方法，我们能够充分利用计算能力来最大程度地减少量化损失并获得优化的推理性能。为了验证我们方法的有效性，我们分别对ImageNet，Pascal VOC和可可数据集进行了图像分类，对象检测和语义分割任务。实验结果表明，所提出的方法可以通过最先进的量化方法实现可比性的性能，同时将推理过程加速约2次。

The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.

下载PDF全文

下载文献需遵守相关版权规定

论文标题