论文标题
俄罗斯网络桌:基于Wikipedia的俄罗斯语言的公共网络表
Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia
论文作者
论文摘要
包含表格数据(例如Webtables)的Corpora对于学术界来说是重要的资源。本质上,它们是信息管理中任何现代研究的骨干。它们用于数据提取,知识库构建,问题答案,列语义类型检测等各种任务。这样的语料库不仅可以作为数据来源,而且是建筑测试数据集的基础。到目前为止,俄罗斯语言还没有这样的语料库,并且这种在上述地区严重阻碍了研究。 在本文中,我们介绍了专门用俄罗斯语言材料创建的第一张网络表语料库。它是通过我们开发的特殊工具包建造的,以爬俄罗斯维基百科。语料库和工具包都是开源的,并且可以公开使用。最后,我们提出了一项简短的研究,描述了俄罗斯Wikipedia表及其统计数据。
Corpora that contain tabular data such as WebTables are a vital resource for the academic community. Essentially, they are the backbone of any modern research in information management. They are used for various tasks of data extraction, knowledge base construction, question answering, column semantic type detection and many other. Such corpora are useful not only as a source of data, but also as a base for building test datasets. So far, there were no such corpora for the Russian language and this seriously hindered research in the aforementioned areas. In this paper, we present the first corpus of Web tables created specifically out of Russian language material. It was built via a special toolkit we have developed to crawl the Russian Wikipedia. Both the corpus and the toolkit are open-source and publicly available. Finally, we present a short study that describes Russian Wikipedia tables and their statistics.