论文标题
数据集制图:用培训动态映射和诊断数据集
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
论文作者
论文摘要
在NLP研究中,大型数据集已成为司空见惯。但是,对数据数量的越来越重视使评估数据质量的挑战。我们介绍了数据图---基于模型的工具来表征和诊断数据集。我们利用了一个在很大程度上被忽略的信息来源:模型在培训期间单个实例(培训动态)的行为,用于构建数据图。对于每个示例,这会产生两个直观的度量---模型对真实类别的置信度,以及在单个训练中获得的这种信心的变异性。四个数据集的实验表明,这些依赖模型的度量揭示了数据图中的三个不同区域,每个区域都有明显的特征。首先,我们的数据地图显示了相对于模型的“模棱两可”区域的存在,这对分布式概括最大。其次,数据中人口最多的区域对于模型来说是“容易学习的”,并且在模型优化中起着重要作用。最后,数据映射了该模型发现“难以学习”的实例的区域;这些通常对应于标记错误。我们的结果表明,从数量到数据质量的焦点转变可能会导致强大的模型并改善分布概括。
Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true class, and the variability of this confidence across epochs---obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of "ambiguous" regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are "easy to learn" for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds "hard to learn"; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.