Ambert：一种具有多元式令牌化的预训练的语言模型

论文标题

Ambert：一种具有多元式令牌化的预训练的语言模型

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

论文作者

Zhang, Xinsong, Li, Pengshuai, Li, Hang

论文摘要

诸如BERT之类的预训练的语言模型在自然语言理解（NLU）的许多任务中表现出了不起的表现。模型中的令牌通常是细粒度的，因为对于英语（例如英语）是单词或子字，而对于像中文这样的语言，它们是字符。例如，在英语中，有多个单词的表达式形成天然词汇单元，因此使用粗粒剂令牌化似乎也是合理的。实际上，细粒度和粗粒的引物都在学习预训练的语言模型方面都有优点和缺点。在本文中，我们提出了一种新型的预训练的语言模型，称为Ambert（一个多元熟悉的BERT），基于细粒度和粗粒剂的象征。 For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of短语。已经在中文和英语的基准数据集上进行了实验，包括线索，胶水，小队和种族。结果表明，在所有情况下，Ambert都能胜过BERT，尤其是对中国人的改进都很重要。我们还开发了一种提高Ambert推理效率的方法，Ambert推论的效率仍然比BERT的效果更好，其计算成本与BERT相同。

Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT can outperform BERT in all cases, particularly the improvements are significant for Chinese. We also develop a method to improve the efficiency of AMBERT in inference, which still performs better than BERT with the same computational cost as BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题