论文标题
压缩伯特:研究修剪体重对转移学习的影响
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
论文作者
论文摘要
预先训练的通用功能提取器,例如用于自然语言处理的BERT和用于计算机视觉的VGG,已成为改善深度学习模型的有效方法,而无需标记的数据。虽然有效,但对于某些部署方案来说,像BERT这样的功能提取器可能非常大。我们探索伯特修剪的重量,并提出:训练期间的压缩如何影响转移学习?我们发现,修剪会影响三个广泛制度的转移学习。较低的修剪水平(30-40%)根本不会影响训练前损失或转移到下游任务。中等水平的修剪会增加训练前损失,并防止有用的训练前信息转移到下游任务。高水平的修剪还可以防止模型拟合下游数据集,从而进一步降级。最后,我们观察到特定任务上的微调伯特并不能提高其未能。我们得出的结论是,在预训练期间可以对BERT进行一次修剪,而不是在每个任务中分别对其进行分离而不会影响绩效。
Pre-trained universal feature extractors, such as BERT for natural language processing and VGG for computer vision, have become effective methods for improving deep learning models without requiring more labeled data. While effective, feature extractors like BERT may be prohibitively large for some deployment scenarios. We explore weight pruning for BERT and ask: how does compression during pre-training affect transfer learning? We find that pruning affects transfer learning in three broad regimes. Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all. Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. High levels of pruning additionally prevent models from fitting downstream datasets, leading to further degradation. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability. We conclude that BERT can be pruned once during pre-training rather than separately for each task without affecting performance.