数据Banzhaf：机器学习的强大数据评估框架

论文标题

数据Banzhaf：机器学习的强大数据评估框架

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

论文作者

Wang, Jiachen T., Jia, Ruoxi

论文摘要

数据评估在机器学习中具有广泛的用例，包括提高数据质量和为数据共享创造经济激励措施。本文研究了数据评估对嘈杂模型性能得分的鲁棒性。特别是，我们发现广泛使用的随机梯度下降的固有随机性会导致现有的数据值概念（例如，沙普利值和剩余错误），以在不同运行中产生不一致的数据值排名。为了应对这一挑战，我们介绍了安全利润的概念，该概念衡量了数据值概念的鲁棒性。我们表明，banzhaf价值是源自合作游戏理论文献的著名价值概念，它在所有半估计中达到了最大的安全利润率（一类值得满足ML应用程序所带来的重要特性的价值概念，并包括著名的Shapley价值和剩余的错误错误）。我们提出了一种算法，以根据最大样本重用（MSR）原理有效估计Banzhaf值。我们的评估表明，Banzhaf值的表现优于几个ML任务的现有基于半的数据值概念，例如使用加权样本学习和嘈杂的标签检测。总体而言，我们的研究表明，当基础ML算法是随机的时，banzhaf值是其他基于半半数的数据值方案的有前途替代方法，因为其计算优势和可稳健地区分数据质量的能力。

Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题