论文标题

它会融合吗?

Will it Unblend?

论文作者

Pinter, Yuval, Jacobs, Cassandra L., Eisenstein, Jacob

论文摘要

自然语言处理系统通常会在培训数据中没有出现的量表术语(OOV)术语中挣扎。诸如“ Innoventor”之类的混合物是OOV的一类特别具有挑战性的类别,因为它们是通过将两个或多个基础融合在一起的,这些基础与不可预测的举止和学位相关的含义。在这项工作中,我们在一个新颖的英语OOV混合物数据集上进行了实验,以量化用大规模上下文语言模型(例如Bert)解释混合物含义的困难。我们首先表明,伯特对这些混合物的处理并不能完全访问组件含义,而他们的上下文表示在语义上却贫穷。我们发现这主要是由于混合形成导致的字符丧失。然后,我们评估不同模型如何识别结构并恢复混合物的起源,并发现嵌入式系统的表现优于角色级别和无上下文的嵌入,尽管它们的结果仍然远非令人满意。

Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源