使用掺杂的Kronecker产品压缩语言模型

论文标题

使用掺杂的Kronecker产品压缩语言模型

Compressing Language Models using Doped Kronecker Products

论文作者

Thakker, Urmish, Whatmough, Paul N., Liu, Zhi-Gang, Mattina, Matthew, Beu, Jesse

论文摘要

Kronecker产品（KP）已被用来通过15-38 x压缩因子来压缩IoT RNN应用，与传统压缩方法相比，取得更好的结果。但是，当KP应用于大型自然语言处理任务时，它会导致明显的准确性损失（约26％）。本文提出了一种通过允许在KP矩阵中增加自由度的自由度来恢复准确性的方法，否则将恢复精度。更正式地，我们提出掺杂，这是在预定义的KP结构顶部添加极稀疏的覆盖矩阵的过程。我们称此压缩方法掺杂了Kronecker产品压缩。为了训练这些模型，我们为共同矩阵适应现象（CMA）提供了一种新的解决方案，该现象使用了一种称为CO矩阵辍学正则化（CMR）的新正则化方案。我们提出了实验结果，该结果证明了具有25 MB尺寸的LSTM层25倍的LSTM层压缩，并且困惑得分损失1.4％。在25倍压缩时，同等的修剪网络的困惑得分损失7.9％，而HMD和LMF分别降低了困惑分数损失15％和27％。

Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression factors, achieving better results than traditional compression methods. However when KP is applied to large Natural Language Processing tasks, it leads to significant accuracy loss (approx 26%). This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing additional degrees of freedom in the KP matrix. More formally, we propose doping, a process of adding an extremely sparse overlay matrix on top of the pre-defined KP structure. We call this compression method doped kronecker product compression. To train these models, we present a new solution to the phenomenon of co-matrix adaption (CMA), which uses a new regularization scheme called co matrix dropout regularization (CMR). We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score. At 25x compression, an equivalent pruned network leads to 7.9% loss in perplexity score, while HMD and LMF lead to 15% and 27% loss in perplexity score respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题