论文标题
NLP应用程序的排名和运行时意识到压缩
Rank and run-time aware compression of NLP Applications
论文作者
论文摘要
基于序列模型的NLP应用程序可能很大。但是,许多受益于它们的应用程序在具有非常有限的计算和存储功能的小型设备上运行,同时仍具有运行时的限制。结果,需要一种可以实现重大压缩的压缩技术,而不会对推理运行时间和任务准确性产生负面影响。本文提出了一种称为混合基质分解的新压缩技术,该技术实现了这一双重目标。 HMF通过使用智能杂种结构将矩阵的等级加倍,从而改善低级矩阵分解(LMF)技术,从而提高了比LMF更好的准确性。此外,通过保留密集的矩阵,它比基于修剪或基于结构矩阵的压缩技术更快的推理运行时间。我们评估了该技术对多个任务的5个NLP基准的影响(翻译,意图检测,语言建模),并表明,对于类似的准确性值和压缩因子,HMF可以比修剪时间更快地实现超过2.32倍的推理运行时间,而比LMF更好16.77%。
Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection, Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.