论文标题
域嵌入的域改编的一种简单方法
A simple method for domain adaptation of sentence embeddings
论文作者
论文摘要
预先训练的句子嵌入已被证明对各种NLP任务非常有用。由于训练此类嵌入需要大量数据,因此通常对各种文本数据进行培训。在许多情况下,对特定领域的适应可以改善结果,但是这种填充通常是问题依赖性的,并且会带来过度适应用于适应的数据的风险。在本文中,我们提出了一种简单的通用方法,用于使用暹罗体系结构对Google的通用句子编码器(使用)进行填充。我们演示了如何将此方法用于各种数据集,并在代表类似问题的不同数据集上呈现结果。该方法还与这些数据集的传统填充进行了比较。作为进一步的优势,该方法可用于将数据集与不同的注释组合。我们还对所有数据集并行提出了一个嵌入式填充。
Pre-trained sentence embeddings have been shown to be very useful for a variety of NLP tasks. Due to the fact that training such embeddings requires a large amount of data, they are commonly trained on a variety of text data. An adaptation to specific domains could improve results in many cases, but such a finetuning is usually problem-dependent and poses the risk of over-adapting to the data used for adaptation. In this paper, we present a simple universal method for finetuning Google's Universal Sentence Encoder (USE) using a Siamese architecture. We demonstrate how to use this approach for a variety of data sets and present results on different data sets representing similar problems. The approach is also compared to traditional finetuning on these data sets. As a further advantage, the approach can be used for combining data sets with different annotations. We also present an embedding finetuned on all data sets in parallel.