用1480种语言自动提取名词的LEAFTOP数据集的构建和评估

论文标题

用1480种语言自动提取名词的LEAFTOP数据集的构建和评估

The Construction and Evaluation of the LEAFTOP Dataset of Automatically Extracted Nouns in 1480 Languages

论文作者

Baker, Greg, Molla-Aliod, Diego

论文摘要

Leftop（从数千个段落中自动提取的语言）数据集由新约福音书中的多个位置出现的名词组成。我们使用一种幼稚的方法（概率推论）来识别1480种其他语言中的可能翻译。我们评估了这一过程，发现它的准确性从42％（Korafe）到99％（Runyankole），在评估的语言中平均正确的72％。该过程最多可以从Koine Greek（平均159）转化为161个不同的引理。我们识别出似乎很容易且难以翻译的名词，该技术起作用的语言家族以及未来可能的改进和扩展。新颖性的主张是：使用Koine Greek Newss作为原始语言；使用源文本的手动创建的手动创建的语法解析；目标语言中的文本自定义刮板；语言相似性的新指标；低资源语言评估的新型策略。

The LEAFTOP (language extracted automatically from thousands of passages) dataset consists of nouns that appear in multiple places in the four gospels of the New Testament. We use a naive approach -- probabilistic inference -- to identify likely translations in 1480 other languages. We evaluate this process and find that it provides lexiconaries with accuracy from 42% (Korafe) to 99% (Runyankole), averaging 72% correct across evaluated languages. The process translates up to 161 distinct lemmas from Koine Greek (average 159). We identify nouns which appear to be easy and hard to translate, language families where this technique works, and future possible improvements and extensions. The claims to novelty are: the use of a Koine Greek New Testament as the source language; using a fully-annotated manually-created grammatically parse of the source text; a custom scraper for texts in the target languages; a new metric for language similarity; a novel strategy for evaluation on low-resource languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题