论文标题

LUT-GEMM:基于LUTS的量化矩阵乘法,以在大规模生成语言模型中有效推断

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

论文作者

Park, Gunho, Park, Baeseong, Kim, Minsub, Lee, Sungjae, Kim, Jeonghoon, Kwon, Beomseok, Kwon, Se Jung, Kim, Byeongwook, Lee, Youngjoo, Lee, Dongsoo

论文摘要

自学学习和变形金刚结构的最新进展已大大改善了自然语言处理(NLP),从而实现了极低的困惑。但是,NLP模型的尺寸不断增长,在生成阶段引入了记忆墙问题。为了减轻此问题,最近的工作重点是将模型权重量化为低于4位的精度,同时保留了充分的激活精度,从而在推断单个GPU期间实现了实用的加速。但是,这些改进主要源于减少的记忆运动,这需要一个资源密集型的去量化过程而不是实际的计算减少。在本文中,我们介绍了LUT-GEMM,这是用于量化矩阵乘法的有效内核,它不仅消除了资源密集型的去量化过程,而且还降低了与以前的仅重量量化内核相比的计算成本。此外,我们提出了小组量化,以在压缩比和准确性之间提供灵活的权衡。通过低位量化和有效的基于LUT的操作实现高压比,可以促进LUT-GEMM的影响。我们通过实验表明,当使用3位量化的OPT-175B模型应用于OPTQ时,LUT-GEMM基本上会加速令牌的产生延迟,从而在单个GPU上获得了显着的2.1 $ \ times $改进,而OPTQ则依赖于昂贵的去量化过程。

Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU. However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction. In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to previous kernels for weight-only quantization. Furthermore, we proposed group-wise quantization to offer a flexible trade-off between compression ratio and accuracy. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency, achieving a remarkable 2.1$\times$ improvement on a single GPU when compared to OPTQ, which relies on the costly dequantization process.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源