我们是否需要标签正则化来微调预训练的语言模型？

论文标题

我们是否需要标签正则化来微调预训练的语言模型？

Do we need Label Regularization to Fine-tune Pre-trained Language Models?

论文作者

Kobyzev, Ivan, Jafari, Aref, Rezagholizadeh, Mehdi, Li, Tianda, Do-Omri, Alan, Lu, Peng, Poupart, Pascal, Ghodsi, Ali

论文摘要

知识蒸馏（KD）是一种突出的神经模型压缩技术，在很大程度上依赖于教师网络预测来指导学生模型的培训。考虑到预先训练的语言模型（PLM）的不断增长的大小，KD通常在涉及PLM的许多NLP任务中采用。但是，很明显，在KD中，在培训期间部署教师网络会增加培训的记忆和计算要求。在计算机视觉文献中，通过证明KD是一种标签正则化技术，可以仔细检查教师网络的必要性，可以用较轻的无教师变体（例如标签平滑技术）代替。但是，据我们所知，这个问题尚未在NLP中研究。因此，这项工作涉及研究不同的标签正则化技术以及我们是否真的需要它们来改善下游任务上较小的PLM网络的微调。在这方面，我们对不同的PLM（例如BERT，ROBERTA和GPT）进行了全面的实验，并进行了600多次不同的试验，并进行了五次配置。这项调查导致了令人惊讶的观察结果，即KD和其他标签正则化技术在预先培训的学生模型时对常规微调没有任何有意义的作用。我们在NLP和计算机视觉任务的不同设置中进一步探讨了这种现象，并证明预训练本身是一种正则化，并且不需要其他标签正则化。

Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the fine-tuning of smaller PLM networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.

下载PDF全文

下载文献需遵守相关版权规定

论文标题