论文标题
数据集修剪:通过检查概括影响来减少培训数据
Dataset Pruning: Reducing Training Data by Examining Generalization Influence
论文作者
论文摘要
深度学习的巨大成功在很大程度上取决于越来越大的培训数据,这是以巨大的计算和基础设施成本的代价。这提出了至关重要的问题,所有培训数据都会导致模型的表现吗?每个单独的培训样本或子训练集都会有多少影响模型的概括,以及如何从整个培训数据中构建最小的子集作为代理训练集,而不会显着牺牲模型的性能?为了回答这些问题,我们提出了基于优化的样本选择方法的数据集修剪,该方法可以(1)检查删除一组特定的培训样本对模型的概括能力的影响,以及(2)构建培训数据的最小子集,该数据产生严格约束的广义差距。凭经验观察到的数据集修剪的概括差距与我们的理论期望基本一致。此外,所提出的方法在CIFAR-10数据集上介绍了40%的训练示例,将收敛时间减少了1.3%的测试精度降低,这比以前的基于基于得分的样本选择方法优于。
The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.