与地理参考语料库的映射语言和人口统计

论文标题

与地理参考语料库的映射语言和人口统计

Mapping Languages and Demographics with Georeferenced Corpora

论文作者

Dunn, Jonathan, Adams, Ben

论文摘要

本文评估了从网络爬行和社交媒体来源摘录的大型地理参考文献，这些语料库是针对地面真实人群和语言范围数据集的。目的是确定（i）哪个数据集最能代表人口人群；（ii）在世界各地，数据集最能代表实际人群；（iii）如何对数据集进行加权以提供基础种群的更准确的表示。本文发现，两个数据集代表非常不同的人群，并且与实际人群相关，值为r = 0.60（社交媒体）和r = 0.49（Web-Crawled）。此外，Twitter数据可以更好地预测每个国家 /地区使用的语言清单。

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r=0.60 (social media) and r=0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.

下载PDF全文

下载文献需遵守相关版权规定

论文标题