论文标题
数据市场中的私人数据评估和公平支付
Private Data Valuation and Fair Payment in Data Marketplaces
论文作者
论文摘要
数据估值是数据市场中的重要任务。它旨在为数据所有者提供贡献。在机器学习社区中,越来越多的认识是,沙普利价值(合作游戏理论中的基本利润共享计划)具有重视数据的主要潜力,因为它独特地满足了公平信用分配的基本属性,并且已证明能够识别出对模型性能有用或有害的数据源。但是,计算沙普利值需要访问原始数据源。它仍然是一个空旷的问题,如何设计一个现实世界中的数据市场,该数据市场利用基于Shapley的数据定价,同时保护隐私并允许公平付款。在本文中,我们提出了数据市场的{\ em First}原型,该{\ em First}原型以隐私的方式根据Shapley值重视数据来源,同时确保了公平的付款。我们的方法是通过算法和系统设计的一系列创新来实现的。首先,我们提出了一种可以通过多党计算(MPC)电路有效实现的沙普利价值计算算法。关键想法是学习一个可以直接预测与输入数据集相对应的模型性能的性能预测因子,而无需执行实际培训。我们根据性能预测变量的结构进一步优化了MPC电路设计。我们进一步将公平付款纳入MPC电路,以确保买方支付的数据与已评估的数据完全相同。我们的实验结果表明,提出的新数据评估算法与原始昂贵的算法一样有效。此外,自定义的MPC协议是有效且可扩展的。
Data valuation is an essential task in a data marketplace. It aims at fairly compensating data owners for their contribution. There is increasing recognition in the machine learning community that the Shapley value -- a foundational profit-sharing scheme in cooperative game theory -- has major potential to value data, because it uniquely satisfies basic properties for fair credit allocation and has been shown to be able to identify data sources that are useful or harmful to model performance. However, calculating the Shapley value requires accessing original data sources. It still remains an open question how to design a real-world data marketplace that takes advantage of the Shapley value-based data pricing while protecting privacy and allowing fair payments. In this paper, we propose the {\em first} prototype of a data marketplace that values data sources based on the Shapley value in a privacy-preserving manner and at the same time ensures fair payments. Our approach is enabled by a suite of innovations on both algorithm and system design. We firstly propose a Shapley value calculation algorithm that can be efficiently implemented via multiparty computation (MPC) circuits. The key idea is to learn a performance predictor that can directly predict model performance corresponding to an input dataset without performing actual training. We further optimize the MPC circuit design based on the structure of the performance predictor. We further incorporate fair payment into the MPC circuit to guarantee that the data that the buyer pays for is exactly the same as the one that has been valuated. Our experimental results show that the proposed new data valuation algorithm is as effective as the original expensive one. Furthermore, the customized MPC protocol is efficient and scalable.