论文标题
大规模隐私:介绍Web隐私政策的私有化语料库
Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
论文作者
论文摘要
组织通过在其网站上发布隐私政策来披露其隐私惯例。尽管用户经常关心其数字隐私,但他们通常不会阅读隐私政策,因为他们需要对时间和精力进行大量投资。尽管自然语言处理可以帮助您了解隐私政策的理解,但缺乏可以用来分析,理解和简化隐私政策的大规模隐私政策语料库。因此,我们创建了一个超过一百万英语网站隐私政策的语料库,其规模大大比以前可用的语料库大得多。我们设计了一个语料库创建管道,该管道包括爬网络,然后使用语言检测,文档分类,重复和近乎删除的删除以及内容提取进行过滤文档。我们研究了语料库的组成,并显示了可读性测试,文档相似性,键形提取的结果,并通过主题建模探索了语料库。
Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.