通过端到端的4位量化加速反复神经网络传感器的推理和语言模型融合

论文标题

通过端到端的4位量化加速反复神经网络传感器的推理和语言模型融合

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

论文作者

Fasoli, Andrea, Chen, Chia-Yu, Serrano, Mauricio, Venkataramani, Swagath, Saon, George, Cui, Xiaodong, Kingsbury, Brian, Gopalakrishnan, Kailash

论文摘要

我们报告了激进的量化策略，这些策略极大地加速了复发性神经网络传感器（RNN-T）的推理。我们对重量和激活都使用4位整数表示，并应用量化意识培训（QAT）来重新训练完整模型（声学编码器和语言模型）并实现近乎ISO的准确性。我们表明，根据网络本地属性量身定制的自定义量化方案对于在限制QAT的计算开销的同时，至关重要。密度比语言模型融合在RNN-T工作负载上表现出了显着的准确性，但它严重增加了推理的计算成本。我们表明，我们的量化策略可以使用大型宽度宽度进行假设搜索，同时实现与流媒体兼容的运行时间，并且与完整的Precision模型相比，我们可以使用流媒体兼容的运行时间和7.6 $ \ times $。通过硬件仿真，我们估计端到端量化的RNN-T（包括LM Fusion）的3.4 $ \ times $加速度从fp16到INT4，导致实时因子（RTF）为0.06。在NIST HUB5 2000，HUB5 2001和RT-03测试集中，我们保留了与LM Fusion相关的大部分收益，将平均含量提高了$ 1.5％。

We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6$\times$ compared to the full precision model. Via hardware simulations, we estimate a 3.4$\times$ acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by $>$1.5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题