LSQ+：通过可学习的偏移和更好的初始化来改善低位量化

论文标题

LSQ+：通过可学习的偏移和更好的初始化来改善低位量化

LSQ+: Improving low-bit quantization through learnable offsets and better initialization

论文作者

Bhalgat, Yash, Lee, Jinwon, Nagel, Markus, Blankevoort, Tijmen, Kwak, Nojun

论文摘要

与Relu不同，经常在流行的有效体系结构中使用的较新的激活功能（例如Swish，H-Swish，Mish）也可能导致负激活值，而偏斜的正和负范围。典型的可学习量化方案[PACT，LSQ]假设非符号量化以进行激活，并将所有负激活量化为零，从而导致性能显着损失。天真地使用签名量化来适应这些负值需要一个额外的符号位，对于低位（2-，3-，4位）量化来说是昂贵的。为了解决这个问题，我们提出了LSQ+，这是LSQ的自然扩展，其中我们引入了具有可训练量表和偏移参数的一般不对称量化方案，可以学会学会适应负激活。基于梯度的可学习量化方案在最终的训练表现中通常也会遭受较高的不稳定性或差异，因此需要大量的超参数调整才能达到令人满意的性能。 LSQ+通过对量化参数的基于MSE的初始化方案来减轻此问题。我们表明，这种初始化会导致多个训练运行中最终性能的差异显着降低。总体而言，LSQ+显示了有效网络和混合网的最先进结果，并且显着超过了具有SWISH激活的神经网的低位量化的LSQ（例如：使用W4A4量化的1.8％增益，W4A4量化的增长率为1.8％，并且使用W2A2量化为5.6％，而W2A2在Imagenet BataSet上具有高效网络B0的量化）。据我们所知，我们的工作是将此类架构量化为极低的位宽度的第一项工作。

Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that are frequently employed in popular efficient architectures can also result in negative activation values, with skewed positive and negative ranges. Typical learnable quantization schemes [PACT, LSQ] assume unsigned quantization for activations and quantize all negative activations to zero which leads to significant loss in performance. Naively using signed quantization to accommodate these negative values requires an extra sign bit which is expensive for low-bit (2-, 3-, 4-bit) quantization. To solve this problem, we propose LSQ+, a natural extension of LSQ, wherein we introduce a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations. Gradient-based learnable quantization schemes also commonly suffer from high instability or variance in the final training performance, hence requiring a great deal of hyper-parameter tuning to reach a satisfactory performance. LSQ+ alleviates this problem by using an MSE-based initialization scheme for the quantization parameters. We show that this initialization leads to significantly lower variance in final performance across multiple training runs. Overall, LSQ+ shows state-of-the-art results for EfficientNet and MixNet and also significantly outperforms LSQ for low-bit quantization of neural nets with Swish activations (e.g.: 1.8% gain with W4A4 quantization and upto 5.6% gain with W2A2 quantization of EfficientNet-B0 on ImageNet dataset). To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bit-widths.

下载PDF全文

下载文献需遵守相关版权规定

论文标题