论文标题
神经网络的最小二乘二进制量化
Least squares binary quantization of neural networks
论文作者
论文摘要
量化深度神经网络的权重和激活会导致推理效率的显着提高,而精度的成本较低。全精度和量化模型之间准确性差距的源是量化误差。在这项工作中,我们专注于二进制量化,其中值映射到-1和1。我们提供了一个统一的框架来分析不同的缩放策略。受到2位与1位量化的帕累托派对性的启发,我们引入了一种新颖的2位量化,证明是最小二乘误差。我们的量化算法可以使用位操作在硬件上有效实现。我们提供证明,以证明我们提出的方法是最佳的,并且还提供了经验误差分析。我们在ImageNet数据集上进行实验,并在使用提出的最小二乘量化算法时显示出降低的精度差距。
Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling strategies. Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error. Our quantization algorithms can be implemented efficiently on the hardware using bitwise operations. We present proofs to show that our proposed methods are optimal, and also provide empirical error analysis. We conduct experiments on the ImageNet dataset and show a reduced accuracy gap when using the proposed least squares quantization algorithms.