平滑：大语言模型的准确有效的培训量化

论文标题

平滑：大语言模型的准确有效的培训量化

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

论文作者

Xiao, Guangxuan, Lin, Ji, Seznec, Mickael, Wu, Hao, Demouth, Julien, Han, Song

论文摘要

大型语言模型（LLMS）表现出色，但是计算和记忆密集型的。量化可以减少记忆并加速推理。但是，现有方法无法同时保持准确性和硬件效率。我们提出平滑量，一种无训练，准确的保留和通用训练后量化（PTQ）解决方案，以实现LLMS的8位重量，8位激活（W8A8）量化。基于重量易于量化而在激活不可量化的事实，通过离线迁移量化难度从激活到权重的量化难度通过数学上等效的转换来平滑激活异常值。 SmoothQuant可以为LLM中所有矩阵乘法（包括OPT，BLOOM，GLM，GLM，MT-NLG，LLAMA-1/2，FALCON，FALCON，MISTRAL，MISTRAL和MIXTRAL模型）进行INT8量化重量和激活。我们证明，LLM的速度最高可达1.56倍，而准确性损失可忽略不计。 SmoothQuant启用可以在单个节点内提供530B LLM。我们的工作提供了一个交钥匙解决方案，可降低硬件成本并使LLM民主化。代码可在https://github.com/mit-han-lab/smoothquant上找到。

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, Llama-1/2, Falcon, Mistral, and Mixtral models. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题