论文标题

大规模隐私:介绍Web隐私政策的私有化语料库

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

论文作者

Srinath, Mukund, Wilson, Shomir, Giles, C. Lee

论文摘要

组织通过在其网站上发布隐私政策来披露其隐私惯例。尽管用户经常关心其数字隐私,但他们通常不会阅读隐私政策,因为他们需要对时间和精力进行大量投资。尽管自然语言处理可以帮助您了解隐私政策的理解,但缺乏可以用来分析,理解和简化隐私政策的大规模隐私政策语料库。因此,我们创建了一个超过一百万英语网站隐私政策的语料库,其规模大大比以前可用的语料库大得多。我们设计了一个语料库创建管道,该管道包括爬网络,然后使用语言检测,文档分类,重复和近乎删除的删除以及内容提取进行过滤文档。我们研究了语料库的组成,并显示了可读性测试,文档相似性,键形提取的结果,并通过主题建模探索了语料库。

Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源