伯特的分层多任务学习方法

论文标题

伯特的分层多任务学习方法

Hierarchical Multitask Learning Approach for BERT

论文作者

Aksoy, Çağla, Ahmetoğlu, Alper, Güngör, Tunga

论文摘要

最近的作品表明，学习文字的上下文嵌入对下游任务有益。伯特是这种方法的成功例子。它通过求解两个任务来学习嵌入，这些任务是蒙版语言模型（蒙版LM）和下一个句子预测（NSP）。 BERT的预培训也可以作为多任务学习问题。在这项工作中，我们采用层次的多任务学习方法进行BERT预培训。在不同的层而不是最后一层求解预训练任务，NSP任务中的信息将转移到蒙版的LM任务中。另外，我们提出了一个新的预训练任务BigRam Shift，以编码单词顺序信息。我们选择了两个下游任务，其中一项需要句子级嵌入（文本需要），另一个需要单词的上下文化嵌入（问题回答）。由于计算限制，我们使用下游任务数据，而不是用于预训练的大型数据集，以查看给出限制数据集时建议的模型的性能。我们测试了他们在几个探测任务上的性能，以分析学习的嵌入。我们的结果表明，在预训练中施加任务层次结构可改善嵌入性能。

Recent works show that learning contextualized embeddings for words is beneficial for downstream tasks. BERT is one successful example of this approach. It learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP). The pre-training of BERT can also be framed as a multitask learning problem. In this work, we adopt hierarchical multitask learning approaches for BERT pre-training. Pre-training tasks are solved at different layers instead of the last layer, and information from the NSP task is transferred to the masked LM task. Also, we propose a new pre-training task bigram shift to encode word order information. We choose two downstream tasks, one of which requires sentence-level embeddings (textual entailment), and the other requires contextualized embeddings of words (question answering). Due to computational restrictions, we use the downstream task data instead of a large dataset for the pre-training to see the performance of proposed models when given a restricted dataset. We test their performance on several probing tasks to analyze learned embeddings. Our results show that imposing a task hierarchy in pre-training improves the performance of embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题