通过功能机制差异化私人反事实

论文标题

通过功能机制差异化私人反事实

Differentially Private Counterfactuals via Functional Mechanism

论文作者

Yang, Fan, Feng, Qizhang, Zhou, Kaixiong, Chen, Jiahao, Hu, Xia

论文摘要

反事实是一种新兴的模型解释类型，最近引起了行业和学术界的大量关注。与传统的基于特征的解释（例如，归因）不同，反事实是一系列假设样本，可以将模型的决策翻转而对查询的扰动最小。鉴于有效的反事实，人类能够在``假设的情况''的情况下进行推理，以便更好地理解模型决策界限。但是，释放反事实可能是有害的，因为它可能无意间泄漏敏感信息给对手，这给模型安全性和数据隐私带来了更高的风险。为了弥合差距，在本文中，我们提出了一个新颖的框架，以生成不同的私人反事实（DPC），而无需触摸已部署的模型或解释集，在该集合中注入了噪音以保护，同时保持反事实的解释作用。特别是，我们使用功能机制训练自动编码器，以构建嘈杂的类原型，然后根据差异隐私的后处理免疫从潜在的原型中得出DPC。进一步的评估证明了拟议框架的有效性，表明DPC可以成功地减轻提取和推理攻击的风险。

Counterfactual, serving as one emerging type of model explanation, has attracted tons of attentions recently from both industry and academia. Different from the conventional feature-based explanations (e.g., attributions), counterfactuals are a series of hypothetical samples which can flip model decisions with minimal perturbations on queries. Given valid counterfactuals, humans are capable of reasoning under ``what-if'' circumstances, so as to better understand the model decision boundaries. However, releasing counterfactuals could be detrimental, since it may unintentionally leak sensitive information to adversaries, which brings about higher risks on both model security and data privacy. To bridge the gap, in this paper, we propose a novel framework to generate differentially private counterfactual (DPC) without touching the deployed model or explanation set, where noises are injected for protection while maintaining the explanation roles of counterfactual. In particular, we train an autoencoder with the functional mechanism to construct noisy class prototypes, and then derive the DPC from the latent prototypes based on the post-processing immunity of differential privacy. Further evaluations demonstrate the effectiveness of the proposed framework, showing that DPC can successfully relieve the risks on both extraction and inference attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题