论文标题
PARANAMES:一个大量的多语言实体名称语料库
ParaNames: A Massively Multilingual Entity Name Corpus
论文作者
论文摘要
我们介绍了Paranames,这是一种多语言并行名称资源,由1.18亿个名称组成,涉及400种语言。为1360万个实体提供了名称,这些实体映射到标准化实体类型(每/loc/org)。使用Wikidata作为来源,我们创建了此类类型的最大资源。我们描述了我们过滤和标准化数据以提供最佳质量的方法。 PANAMES对于多语言语言处理非常有用,既可以定义名称翻译/音译的任务,又可以作为任务的补充数据,例如命名实体识别和链接。我们通过训练与英文和英语的规范名称翻译的多语言模型来演示对照群的应用。我们的资源是在https://github.com/bltlab/paranames上发布的,以创意共享许可证(CC By 4.0)发布。
We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.