论文标题
数据起源于机器学习
Data Origin Inference in Machine Learning
论文作者
论文摘要
在ML模型中利用意想不到的记忆来使现实世界应用程序受益,这是一个越来越多的方向。站在ML模型开发点上,我们引入了一个名为数据来源推断的过程,以帮助ML开发人员在训练集中找到错过或错误的数据来源,而无需保持剧烈的元数据。我们正式定义了ML模型(主要是神经网络)的开发中的数据来源和数据来源推理任务。然后,我们提出了一种新颖的推理策略,结合了嵌入式空间多重实例分类和阴影训练。各种用例涵盖语言,视觉和结构化数据,具有各种数据来源(例如商业,县,电影,移动用户,文本作者)。对我们提出的策略的全面绩效分析包含参考的目标模型层,每个来源的可用测试数据,以及在影子训练中,功能提取的实现以及阴影模型。当目标模型是基于变压器的深神经网络时,我们最佳的推理精度在语言用例中可实现98.96%。此外,我们对不同类型的数据来源进行统计分析,以调查可能正确推断出哪种形式。
It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy contains referenced target model layers, available testing data for each origin, and in shadow training, the implementations of feature extraction as well as shadow models. Our best inference accuracy achieves 98.96% in the language use case when the target model is a transformer-based deep neural network. Furthermore, we give a statistical analysis of different kinds of data origin to investigate what kind of origin is probably to be inferred correctly.