集合数据的多元微聚集

论文标题

集合数据的多元微聚集

Multivariate Microaggregation of Set-Valued Data

论文作者

Imran-Daud, Malik, Shaheen, Muhammad, Ahmed, Abbas

论文摘要

数据控制器管理巨大的数据，有时会公开发布，以帮助研究人员进行研究。但是，该公开共享的数据可能会持有可以收集以重新识别人的个人身份信息（PII）。因此，需要有效的匿名机制才能在公开发布此类数据之前对其进行匿名化。微聚集是许多研究人员广泛使用的统计披露控制（SDC）方法之一。这种方法适应了K匿名性原则，以在同一群集中生成可区分可区分的记录，以保留个人的隐私。但是，在这些方法中，群集的大小是固定的（即k记录），并且通过这些方法生成的簇可能具有非均匀记录。通过考虑这些问题，我们提出了一种自适应大小聚类技术，该技术在类似簇中汇总了均匀记录，并且在记录的语义分析后确定簇的大小。为了实现这一目标，我们将MDAV微聚集算法扩展到语义上，以通过依靠分类学数据库（即WordNet）来分析非结构化记录，然后将它们汇总在均匀群集中。此外，我们提出了一个距离度量，以决定记录彼此不同的程度，并基于此构建同质的自适应簇。在实验中，我们测量了簇的凝聚力，以评估记录的同质性。另外，提出了一种方法来衡量由修订方法引起的信息损失。在实验中，结果表明，所提出的机制优于现有的最新解决方案。

Data controllers manage immense data, and occasionally, it is released publically to help the researchers to conduct their studies. However, this publically shared data may hold personally identifiable information (PII) that can be collected to re-identify a person. Therefore, an effective anonymization mechanism is required to anonymize such data before it is released publically. Microaggregation is one of the Statistical Disclosure Control (SDC) methods that are widely used by many researchers. This method adapts the k-anonymity principle to generate k-indistinguishable records in the same clusters to preserve the privacy of the individuals. However, in these methods, the size of the clusters is fixed (i.e., k records), and the clusters generated through these methods may hold non-homogeneous records. By considering these issues, we propose an adaptive size clustering technique that aggregates homogeneous records in similar clusters, and the size of the clusters is determined after the semantic analysis of the records. To achieve this, we extend the MDAV microaggregation algorithm to semantically analyze the unstructured records by relying on the taxonomic databases (i.e., WordNet), and then aggregating them in homogeneous clusters. Furthermore, we propose a distance measure that determines the extent to which the records differ from each other, and based on this, homogeneous adaptive clusters are constructed. In experiments, we measured the cohesiveness of the clusters in order to gauge the homogeneity of records. In addition, a method is proposed to measure information loss caused by the redaction method. In experiments, the results show that the proposed mechanism outperforms the existing state-of-the-art solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题