成本效果：通过纤细的多效语言模型对空间和时间效率的协作优化

论文标题

成本效果：通过纤细的多效语言模型对空间和时间效率的协作优化

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

论文作者

Shen, Bowen, Lin, Zheng, Liu, Yuanxin, Liu, Zhengxiao, Wang, Lei, Wang, Weiping

论文摘要

基于变压器的预训练的语言模型（PLM）尽管能力提高了，但主要遭受了过多的开销。对于资源受限的设备，迫切需要在空间和时间上有效的模型，该模型保留了PLM的主要能力。但是，现有的静态压缩模型并未意识到输入实例之间的各种复杂性，可能导致简单和复杂输入的冗余和不足。同样，在做出预测和为更深层次的层次服务之间取舍的微型模型与早期退出的遇到挑战。在这种考虑因素的推动下，我们提出了对PLM的协作优化，以整合静态模型压缩和动态推理加速度。具体而言，PLM的宽度细长，而深度保持完整，并在层次上互补，以动态地提高推理。为了解决早期退出的权衡，我们提出了一种联合培训方法，该方法可以校准苗条，并保留每个出口的贡献结构，而不仅仅是最终层。实验是根据胶水基准进行的，结果以1/8参数和1/19的BERT验证了我们方法的帕累托最优性。

Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. For resource-constrained devices, there is an urgent need for a spatially and temporally efficient model which retains the major capacity of PLMs. However, existing statically compressed models are unaware of the diverse complexities between input instances, potentially resulting in redundancy and inadequacy for simple and complex inputs. Also, miniature models with early exiting encounter challenges in the trade-off between making predictions and serving the deeper layers. Motivated by such considerations, we propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration. Specifically, the PLM is slenderized in width while the depth remains intact, complementing layer-wise early exiting to speed up inference dynamically. To address the trade-off of early exiting, we propose a joint training approach that calibrates slenderization and preserves contributive structures to each exit instead of only the final layer. Experiments are conducted on GLUE benchmark and the results verify the Pareto optimality of our approach at high compression and acceleration rate with 1/8 parameters and 1/19 FLOPs of BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题