论文标题
知道你的语料库! Web Corpora数字策划的强大方法
Know thy corpus! Robust methods for digital curation of Web corpora
论文作者
论文摘要
本文提出了一个新的Web Corpora数字策划的新框架,以便对其参数(例如其构图和词典)进行强有力的估计。近年来,在大型Corpora上预先培训的语言模型在许多NLP任务中成为明显的赢家,但是对导致其成功的Corpora没有适当的分析。本文提出了一个鲁棒频率估计的过程,有助于建立给定语料库的核心词典,以及通过无监督主题模型和通过监督网页的监督类型分类来估算语料库组成的过程。数字策划研究的结果适用于几个网络衍生的语料库,证明了它们的差异。首先,这涉及影响从每个语料库获得的核心词典的不同频率爆发。其次,这涉及它们所包含的各种文本。例如,与UKWAC或Wikipedia相比,OpenWebText包含更多的话题和政治论证。工具和分析结果已发布。
This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsupervised topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.