论文标题
域的一般转移标签适应性键形生成
General-to-Specific Transfer Labeling for Domain Adaptable Keyphrase Generation
论文作者
论文摘要
训练键形生成(KPG)模型需要大量注释的数据,这些数据可能非常昂贵,并且通常仅限于特定域。在这项研究中,我们首先证明了不同领域之间的巨大分布变化极大地阻碍了kpg模型的可传递性。然后,我们提出了一条三阶段的管道,该管道逐渐以数据效率的方式指导KPG模型的学习重点从一般句法特征到与域相关的语义。借助域将军短语预训练,我们使用通用短语注释进行预训练序列到序列模型,这些模型在Web上广泛使用,这使模型能够在广泛的域中生成短语。然后将所得模型应用于传输标签阶段,以产生域特异性伪键形,这有助于将模型适应新域。最后,我们使用有限的数据将模型微调,以将其完全适应目标域。我们的实验结果表明,所提出的过程可以在新领域中产生优质的键形,并在适应有限的域注释数据后进行一致的改进。所有代码和数据集可在https://github.com/memray/opennmt-kpg-rease上找到。
Training keyphrase generation (KPG) models require a large amount of annotated data, which can be prohibitively expensive and often limited to specific domains. In this study, we first demonstrate that large distribution shifts among different domains severely hinder the transferability of KPG models. We then propose a three-stage pipeline, which gradually guides KPG models' learning focus from general syntactical features to domain-related semantics, in a data-efficient manner. With Domain-general Phrase pre-training, we pre-train Sequence-to-Sequence models with generic phrase annotations that are widely available on the web, which enables the models to generate phrases in a wide range of domains. The resulting model is then applied in the Transfer Labeling stage to produce domain-specific pseudo keyphrases, which help adapt models to a new domain. Finally, we fine-tune the model with limited data with true labels to fully adapt it to the target domain. Our experiment results show that the proposed process can produce good-quality keyphrases in new domains and achieve consistent improvements after adaptation with limited in-domain annotated data. All code and datasets are available at https://github.com/memray/OpenNMT-kpg-release.