法律词嵌入的评估数据集：中文法典的案例研究

论文标题

法律词嵌入的评估数据集：中文法典的案例研究

An Evaluation Dataset for Legal Word Embedding: A Case Study On Chinese Codex

论文作者

Lin, Chun-Hsien, Cheng, Pu-Jen

论文摘要

单词嵌入是一种现代的分布式单词表示方法，在许多自然语言处理任务中广泛使用。将法律文档中的词汇转换为单词嵌入模型的词汇，有助于对机器学习，深度学习和其他算法进行法律文档，然后执行有关自然语言处理的下游任务，例如，文档分类，合同审查和机器翻译。使用嵌入模型的单词进行精确评估的最常见和实用的方法使用了具有语言规则的基准设置或单词之间的关系以通过代数计算执行类比推理。本文提议使用五种法律关系建立来自2,388个中国法典语料库中的1,134个法律类似推理问题（LARQ），然后将其用于评估中文单词嵌入模型的准确性。此外，我们发现法律关系在嵌入模型一词中可能无处不在。

Word embedding is a modern distributed word representations approach widely used in many natural language processing tasks. Converting the vocabulary in a legal document into a word embedding model facilitates subjecting legal documents to machine learning, deep learning, and other algorithms and subsequently performing the downstream tasks of natural language processing vis-à-vis, for instance, document classification, contract review, and machine translation. The most common and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation. This paper proposes establishing a 1,134 Legal Analogical Reasoning Questions Set (LARQS) from the 2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be ubiquitous in the word embedding model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题