在设备上语音识别的低8位量化：无正则方法

论文标题

在设备上语音识别的低8位量化：无正则方法

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

论文作者

Zhen, Kai, Radfar, Martin, Nguyen, Hieu Duy, Strimel, Grant P., Susanj, Nathan, Mouchtaris, Athanasios

论文摘要

对于设备自动语音识别（ASR），量化意识培训（QAT）无处不在，可以实现模型预测性能和效率之间的权衡。在现有的QAT方法中，一个主要缺点是必须预先确定和固定量化质心。为了克服这一局限性，我们引入了一种无正则化的“软”压缩机制，具有可自动调节的质心的限制空间，从而导致更简单，更广泛的量化方案，称为通用量化器（GQ）。我们将GQ应用于ASR任务，使用循环神经网络换能器（RNN-T）和在Librispeech和De-Sideifiend的远场数据集上的构象体架构。如果没有准确的降解，GQ可以将RNN-T和构象体压缩为8位，对于某些RNN-T层，以快速准确的推断为1位。通过物理设备基准测试，我们观察到30.73％的内存足迹节省了30.73％的记忆足迹和31.75％的用户感知延迟减少。

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题