论文标题

Dectrr:无监督的文本表示的深层对比度学习

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

论文作者

Giorgi, John, Nitski, Osvald, Wang, Bo, Bader, Gary

论文摘要

句子嵌入是许多自然语言处理(NLP)系统的重要组成部分。像单词嵌入一样,通常在大型文本语料库上学习句子嵌入,然后转移到各种下游任务,例如聚类和检索。与单词嵌入不同,学习句子嵌入的性能最高的解决方案需要标记的数据,从而将其有用性限制为标记数据丰富的语言和域。在本文中,我们介绍了Dectrutr:无监督的文本表示的深层对比学习。受深度度量学习(DML)最新进展的启发,我们仔细设计了一个自制的目标,用于学习不需要标记的培训数据的通用句子嵌入。当用于扩展基于变压器的语言模型的训练预处理时,我们的方法弥补了通用句子编码器的无监督和监督预处理之间的性能差距。重要的是,我们的实验表明,学到的嵌入量表的质量都具有可训练的参数的数量和未标记的培训数据的数量。我们的代码和预验证的模型已公开可用,可以轻松地适应新的域或用于嵌入看不见的文本。

Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Inspired by recent advances in deep metric learning (DML), we carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Importantly, our experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data. Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源