论文标题
Leibi@Coliee 2022:通过基于集群驱动的BERT模型汇总的词汇模型,用于判例法检索
LeiBi@COLIEE 2022: Aggregating Tuned Lexical Models with a Cluster-driven BERT-based Model for Case Law Retrieval
论文作者
论文摘要
本文总结了我们在法律信息提取/索取竞争中提交的方法(COLIEE)2022。我们的方法包括四个步骤;详细说明,鉴于法律案件作为查询,我们通过提取各种有意义的句子或n-grams来对其进行重新进行重新重新制定。然后,我们利用预处理的查询案例来检索初始可能的相关法律案件,我们将进一步重新排列。最后,我们汇总了第一阶段获得的相关性分数以及改善检索效率的重新排列模型。在我们的方法论的每个步骤中,我们都探索了各种知名和新颖的方法。特别是,为了重新制定旨在使其较短的查询案例,我们使用三种不同的统计方法提取timrams:KLI,PLM,IDF-R,以及利用嵌入的模型(例如Keybert)。此外,我们研究使用Longformer-编编码器(LED)的自动汇总是否可以为此检索任务产生有效的查询表示形式。此外,我们提出了一种新颖的重新排列群集驱动的方法,该方法利用了句子模型,这些模型在大量数据上预先调整,以嵌入查询和候选文档中的句子。最后,我们采用了一种线性聚合方法来结合传统IR模型和基于神经模型的相关性分数,旨在结合对神经模型的语义理解以及统计测量的主题相关性。我们表明,汇总这些相关性得分可以提高整体检索效率。
This paper summarizes our approaches submitted to the case law retrieval task in the Competition on Legal Information Extraction/Entailment (COLIEE) 2022. Our methodology consists of four steps; in detail, given a legal case as a query, we reformulate it by extracting various meaningful sentences or n-grams. Then, we utilize the pre-processed query case to retrieve an initial set of possible relevant legal cases, which we further re-rank. Lastly, we aggregate the relevance scores obtained by the first stage and the re-ranking models to improve retrieval effectiveness. In each step of our methodology, we explore various well-known and novel methods. In particular, to reformulate the query cases aiming to make them shorter, we extract unigrams using three different statistical methods: KLI, PLM, IDF-r, as well as models that leverage embeddings (e.g., KeyBERT). Moreover, we investigate if automatic summarization using Longformer-Encoder-Decoder (LED) can produce an effective query representation for this retrieval task. Furthermore, we propose a novel re-ranking cluster-driven approach, which leverages Sentence-BERT models that are pre-tuned on large amounts of data for embedding sentences from query and candidate documents. Finally, we employ a linear aggregation method to combine the relevance scores obtained by traditional IR models and neural-based models, aiming to incorporate the semantic understanding of neural models and the statistically measured topical relevance. We show that aggregating these relevance scores can improve the overall retrieval effectiveness.