语言任务的训练后量化方法的经验评估

论文标题

语言任务的训练后量化方法的经验评估

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

论文作者

Hu, Ting, Meinel, Christoph, Yang, Haojin

论文摘要

基于变形金刚的架构等伯特（Bert）在各种自然语言任务中取得了巨大的成功。尽管表现不错，但这些模型仍然具有许多参数和高计算复杂性，阻碍了它们在资源受限环境中的部署。训练后量化（PTQ）可以实现无额外培训的低位计算，这可能是一个有前途的工具。在这项工作中，我们对BERT-BASE和BERT-LARGE上的三种PTQ方法进行了经验评估：线性量化（LQ），整数量化（ACIQ）的分析剪辑（ACIQ）和离群通道拆分（OCS）。理论上，OCS在最小化均方量化误差并避免扭曲权重异常值的情况下超越了其他OC。这与大多数胶合基准的语言任务和阅读理解任务的评估结果一致。此外，低位量化的BERT模型可以胜过几个小语言任务的相应的32位基线，我们将其归因于减轻过度参数化。我们进一步探讨了量化位的极限，并表明OC可以将Bert-Base和Bert-large量化为3位，并因此将98％和96％的性能保留在胶水基准上。此外，我们对整个BERT家族（即不同配置的BERT模型进行量化）进行量化，并全面评估其在胶水基准和小队上的性能，希望为在各种计算环境中的部署提供宝贵的指南。

Transformer-based architectures like BERT have achieved great success in a wide range of Natural Language tasks. Despite their decent performance, the models still have numerous parameters and high computational complexity, impeding their deployment in resource-constrained environments. Post-Training Quantization (PTQ), which enables low-bit computations without extra training, could be a promising tool. In this work, we conduct an empirical evaluation of three PTQ methods on BERT-Base and BERT-Large: Linear Quantization (LQ), Analytical Clipping for Integer Quantization (ACIQ), and Outlier Channel Splitting (OCS). OCS theoretically surpasses the others in minimizing the Mean Square quantization Error and avoiding distorting the weights' outliers. That is consistent with the evaluation results of most language tasks of GLUE benchmark and a reading comprehension task, SQuAD. Moreover, low-bit quantized BERT models could outperform the corresponding 32-bit baselines on several small language tasks, which we attribute to the alleviation of over-parameterization. We further explore the limit of quantization bit and show that OCS could quantize BERT-Base and BERT-Large to 3-bits and retain 98% and 96% of the performance on the GLUE benchmark accordingly. Moreover, we conduct quantization on the whole BERT family, i.e., BERT models in different configurations, and comprehensively evaluate their performance on the GLUE benchmark and SQuAD, hoping to provide valuable guidelines for their deployment in various computation environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题