论文标题

标签噪声类型及其对深度学习的影响

Label Noise Types and Their Effects on Deep Learning

论文作者

Algan, Görkem, Ulusoy, İlkay

论文摘要

深度学习的最新成功主要是由于带有干净注释的大数据集的可用性。但是,由于实际挑战,收集清洁注释的数据集并不总是可行的。结果,标签噪声是数据集中的一个常见问题,并且在文献中提出了许多在存在嘈杂标签的情况下训练深层神经网络的方法。这些方法通常在训练集上使用带有合成标签噪声的基准数据集。但是,有多种类型的标签噪声,每个标签噪声都对学习有其自身的特征影响。由于每项工作都会产生不同类型的标签噪声,因此公平地测试和比较文献中的这些算法是有问题的。在这项工作中,我们提供了各种标签噪声对学习的影响的详细分析。此外,我们提出了一个通用框架来产生功能依赖性标签噪声,我们表明这是学习最具挑战性的案例。我们提出的方法旨在通过将它们稀少地分布在功能域中来强调数据实例之间的相似性。通过这种方法,从其SoftMax概率中检测到更可能被错误标记的样品,并将其标签翻转为相应的类别。所提出的方法可以应用于任何干净的数据集以合成特征依赖性嘈杂标签。为了使其他研究人员轻松使用嘈杂的标签测试其算法,我们为最常用的基准数据集共享损坏的标签。我们的代码和生成的嘈杂合成标签可在线提供。

The recent success of deep learning is mostly due to the availability of big datasets with clean annotations. However, gathering a cleanly annotated dataset is not always feasible due to practical challenges. As a result, label noise is a common problem in datasets, and numerous methods to train deep neural networks in the presence of noisy labels are proposed in the literature. These methods commonly use benchmark datasets with synthetic label noise on the training set. However, there are multiple types of label noise, and each of them has its own characteristic impact on learning. Since each work generates a different kind of label noise, it is problematic to test and compare those algorithms in the literature fairly. In this work, we provide a detailed analysis of the effects of different kinds of label noise on learning. Moreover, we propose a generic framework to generate feature-dependent label noise, which we show to be the most challenging case for learning. Our proposed method aims to emphasize similarities among data instances by sparsely distributing them in the feature domain. By this approach, samples that are more likely to be mislabeled are detected from their softmax probabilities, and their labels are flipped to the corresponding class. The proposed method can be applied to any clean dataset to synthesize feature-dependent noisy labels. For the ease of other researchers to test their algorithms with noisy labels, we share corrupted labels for the most commonly used benchmark datasets. Our code and generated noisy synthetic labels are available online.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源