通过差分审核数据出处

论文标题

通过差分审核数据出处

Data Provenance via Differential Auditing

论文作者

Mu, Xin, Pang, Ming, Zhu, Feida

论文摘要

审核数据出处（ADP），即如果使用某个数据来训练机器学习模型，则审核是数据出处的重要问题。在某些条件下，例如标签信息的可用性以及目标模型的培训协议知识，现有的审核技术（例如，阴影审核方法）已经证明了任务的可行性。不幸的是，这两种情况在实际应用中通常无法使用。在本文中，我们通过差异审核（DPDA）介绍了数据出处，这是一个实用的框架，用于基于统计学上重要的差异，即在精心设计的转换之后，使用不同的方法进行审核数据，从目标模型的训练集进行了扰动的输入数据，从而导致与模型未经训练集合的输出相比，导致更大的变化。该框架允许审核员将培训数据与非训练数据区分开，而无需借助标记的输出数据训练任何阴影模型。此外，我们提出了两个有效的审核函数实现，即一个加性功能和一个乘法。我们报告对现实世界数据集的评估，证明了我们提出的审计技术的有效性。

Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has been used to train a machine learning model, is an important problem in data provenance. The feasibility of the task has been demonstrated by existing auditing techniques, e.g., shadow auditing methods, under certain conditions such as the availability of label information and the knowledge of training protocols for the target model. Unfortunately, both of these conditions are often unavailable in real applications. In this paper, we introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance with a different approach based on statistically significant differentials, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. This framework allows auditors to distinguish training data from non-training ones without the need of training any shadow models with the help of labeled output data. Furthermore, we propose two effective auditing function implementations, an additive one and a multiplicative one. We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.

下载PDF全文

下载文献需遵守相关版权规定

论文标题