检查现成算法在社交媒体数据中直接识别可识别信息的可行性

论文标题

检查现成算法在社交媒体数据中直接识别可识别信息的可行性

Examining the Feasibility of Off-the-Shelf Algorithms for Masking Directly Identifiable Information in Social Media Data

论文作者

Dorn, Rachel, Nobles, Alicia L., Rouhizadeh, Masoud, Dredze, Mark

论文摘要

从社交媒体数据中识别和删除/替换受保护信息是一个研究的问题，尽管从道德和法律的角度来看是可取的。本文识别推文中包含的潜在直接识别信息（受临床文本中受保护的健康信息的启发），可以使用现成的算法很容易删除这些信息，并将其引入的英文数据集介绍为可识别的信息，并通过将这些现成的算法删除（nightif）将其删除（nightif）figary（nightif）night figary（nightjar）（仿制）推文。可以从https://bitbucket.org/mdredze/nightjar检索Nightjar以及注释的数据。

The identification and removal/replacement of protected information from social media data is an understudied problem, despite being desirable from an ethical and legal perspective. This paper identifies types of potentially directly identifiable information (inspired by protected health information in clinical texts) contained in tweets that may be readily removed using off-the-shelf algorithms, introduces an English dataset of tweets annotated for identifiable information, and compiles these off-the-shelf algorithms into a tool (Nightjar) to evaluate the feasibility of using Nightjar to remove directly identifiable information from the tweets. Nightjar as well as the annotated data can be retrieved from https://bitbucket.org/mdredze/nightjar.

下载PDF全文

下载文献需遵守相关版权规定

论文标题