LUT-GEMM：基于LUTS的量化矩阵乘法，以在大规模生成语言模型中有效推断

论文标题

LUT-GEMM：基于LUTS的量化矩阵乘法，以在大规模生成语言模型中有效推断

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

论文作者

Park, Gunho, Park, Baeseong, Kim, Minsub, Lee, Sungjae, Kim, Jeonghoon, Kwon, Beomseok, Kwon, Se Jung, Kim, Byeongwook, Lee, Youngjoo, Lee, Dongsoo

论文摘要

自学学习和变形金刚结构的最新进展已大大改善了自然语言处理（NLP），从而实现了极低的困惑。但是，NLP模型的尺寸不断增长，在生成阶段引入了记忆墙问题。为了减轻此问题，最近的工作重点是将模型权重量化为低于4位的精度，同时保留了充分的激活精度，从而在推断单个GPU期间实现了实用的加速。但是，这些改进主要源于减少的记忆运动，这需要一个资源密集型的去量化过程而不是实际的计算减少。在本文中，我们介绍了LUT-GEMM，这是用于量化矩阵乘法的有效内核，它不仅消除了资源密集型的去量化过程，而且还降低了与以前的仅重量量化内核相比的计算成本。此外，我们提出了小组量化，以在压缩比和准确性之间提供灵活的权衡。通过低位量化和有效的基于LUT的操作实现高压比，可以促进LUT-GEMM的影响。我们通过实验表明，当使用3位量化的OPT-175B模型应用于OPTQ时，LUT-GEMM基本上会加速令牌的产生延迟，从而在单个GPU上获得了显着的2.1 $ \ times $改进，而OPTQ则依赖于昂贵的去量化过程。

Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU. However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction. In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to previous kernels for weight-only quantization. Furthermore, we proposed group-wise quantization to offer a flexible trade-off between compression ratio and accuracy. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency, achieving a remarkable 2.1$\times$ improvement on a single GPU when compared to OPTQ, which relies on the costly dequantization process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题