论文标题
纠察队:在学习和推理期间,在表格数据中守护损坏的数据
Picket: Guarding Against Corrupted Data in Tabular Data during Learning and Inference
论文作者
论文摘要
数据损坏是现代机器学习部署的障碍。损坏的数据可能会严重偏向学习的模型,也可能导致无效的推论。我们介绍了一个简单的框架,可以保护数据损坏在训练和部署机器学习模型上,而不是表格数据。在培训阶段,纠察队从培训数据中识别并删除了损坏的数据点,以避免获得有偏见的模型。对于部署阶段,以在线方式的纠缠量损坏的查询指向经过训练的机器学习模型,由于噪声而导致的,这将导致错误的预测。为了检测损坏的数据,Picket使用自制的深度学习模型来用于混合型表格数据,我们称之为Picketnet。为了最大程度地减少部署的负担,学习纠察网模型不需要任何人类标记的数据。 Picket被设计为可以增加任何机器学习管道的鲁棒性的插件。我们考虑了包括在训练和测试期间包括系统的和对抗性噪声在内的不同腐败模型的不同腐败模型,对各种现实世界数据进行评估。我们表明,纠察队在培训和部署从SVM到神经网络的各种模型的过程中始终如一地保护损坏的数据,击败了从数据质量验证模型到强大的异常值检测模型的各种竞争方法。
Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data. For the training stage, Picket identifies and removes corrupted data points from the training data to avoid obtaining a biased model. For the deployment stage, Picket flags, in an online manner, corrupted query points to a trained machine learning model that due to noise will result in incorrect predictions. To detect corrupted data, Picket uses a self-supervised deep learning model for mixed-type tabular data, which we call PicketNet. To minimize the burden of deployment, learning a PicketNet model does not require any human-labeled data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise during both training and testing. We show that Picket consistently safeguards against corrupted data during both training and deployment of various models ranging from SVMs to neural networks, beating a diverse array of competing methods that span from data quality validation models to robust outlier-detection models.