改善对句子嵌入的对比度学习，并通过案例提出阳性并检索否定性

论文标题

改善对句子嵌入的对比度学习，并通过案例提出阳性并检索否定性

Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

论文作者

Wang, Wei, Ge, Liangzhu, Zhang, Jingqiao, Yang, Cheng

论文摘要

在SIMCSE之后，基于对比的学习方法已在学习句子嵌入中实现了最先进的（SOTA）性能。但是，无监督的对比学习方法仍然远远落后于受监督的同行。我们将其归因于正面和负样本的质量，并旨在改善两者。具体而言，对于正样本，我们提出了开关案例扩展，以翻转句子中随机选择单词的第一个字母的情况。这是为了抵消预先训练的令牌嵌入到频率，单词案例和子字的内在偏差。对于负样本，我们根据预先训练的语言模型对整个数据集进行了严格的负面影响。将上述两种方法与SIMCSE相结合，我们提出的对比度学习与增强和检索的句子嵌入（卡）方法的数据大大超过了无监督环境中STS基准上的当前SOTA。

Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题