论文标题

使用掺杂的Kronecker产品压缩语言模型

Compressing Language Models using Doped Kronecker Products

论文作者

Thakker, Urmish, Whatmough, Paul N., Liu, Zhi-Gang, Mattina, Matthew, Beu, Jesse

论文摘要

Kronecker产品(KP)已被用来通过15-38 x压缩因子来压缩IoT RNN应用,与传统压缩方法相比,取得更好的结果。但是,当KP应用于大型自然语言处理任务时,它会导致明显的准确性损失(约26%)。本文提出了一种通过允许在KP矩阵中增加自由度的自由度来恢复准确性的方法,否则将恢复精度。更正式地,我们提出掺杂,这是在预定义的KP结构顶部添加极稀疏的覆盖矩阵的过程。我们称此压缩方法掺杂了Kronecker产品压缩。为了训练这些模型,我们为共同矩阵适应现象(CMA)提供了一种新的解决方案,该现象使用了一种称为CO矩阵辍学正则化(CMR)的新正则化方案。我们提出了实验结果,该结果证明了具有25 MB尺寸的LSTM层25倍的LSTM层压缩,并且困惑得分损失1.4%。在25倍压缩时,同等的修剪网络的困惑得分损失7.9%,而HMD和LMF分别降低了困惑分数损失15%和27%。

Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression factors, achieving better results than traditional compression methods. However when KP is applied to large Natural Language Processing tasks, it leads to significant accuracy loss (approx 26%). This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing additional degrees of freedom in the KP matrix. More formally, we propose doping, a process of adding an extremely sparse overlay matrix on top of the pre-defined KP structure. We call this compression method doped kronecker product compression. To train these models, we present a new solution to the phenomenon of co-matrix adaption (CMA), which uses a new regularization scheme called co matrix dropout regularization (CMR). We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score. At 25x compression, an equivalent pruned network leads to 7.9% loss in perplexity score, while HMD and LMF lead to 15% and 27% loss in perplexity score respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源