论文标题

有效的主动学习管道,用于法律文本分类

An Efficient Active Learning Pipeline for Legal Text Classification

论文作者

Mamooler, Sepideh, Lebret, Rémi, Massonnet, Stéphane, Aberer, Karl

论文摘要

主动学习(AL)是一种有力的工具,可以使用标记较低的数据进行学习,特别是对于法律文档,没有标记的数据丰富的专业领域,但是注释需要域专业知识,因此很昂贵。最近的作品表明了AL策略对预训练的语言模型的有效性。但是,大多数AL策略都需要一组标记的样本,这是昂贵的。此外,在与小数据集进行微调时,预训练的语言模型已经不稳定,并且它们的嵌入在语义上并不有意义。在这项工作中,我们提出了一条管道,用于有效地使用法律领域中的预训练语言模型的主动学习。为此,我们在三个阶段中利用可用的未标记数据。首先,我们继续对模型进行预训练,以使其适应下游任务。其次,我们使用知识蒸馏来指导模型的嵌入到语义上有意义的空间中。最后,我们提出了一种简单但有效的策略,以找到与现有方法相比,采取较少动作的标签样品集。我们对合同NLI的实验,适合分类任务,而Ledgar基准表明,我们的方法优于标准策略,并且更有效。此外,我们的管道与较小的性能差距达到了完全监督方法的可比结果,并大大降低了注释成本。代码和改编的数据将提供。

Active Learning (AL) is a powerful tool for learning with less labeled data, in particular, for specialized domains, like legal documents, where unlabeled data is abundant, but the annotation requires domain expertise and is thus expensive. Recent works have shown the effectiveness of AL strategies for pre-trained language models. However, most AL strategies require a set of labeled samples to start with, which is expensive to acquire. In addition, pre-trained language models have been shown unstable during fine-tuning with small datasets, and their embeddings are not semantically meaningful. In this work, we propose a pipeline for effectively using active learning with pre-trained language models in the legal domain. To this end, we leverage the available unlabeled data in three phases. First, we continue pre-training the model to adapt it to the downstream task. Second, we use knowledge distillation to guide the model's embeddings to a semantically meaningful space. Finally, we propose a simple, yet effective, strategy to find the initial set of labeled samples with fewer actions compared to existing methods. Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies, and is more efficient. Furthermore, our pipeline reaches comparable results to the fully-supervised approach with a small performance gap, and dramatically reduced annotation cost. Code and the adapted data will be made available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源