关于数据增强中线性转换的概括效应

论文标题

关于数据增强中线性转换的概括效应

On the Generalization Effects of Linear Transformations in Data Augmentation

论文作者

Wu, Sen, Zhang, Hongyang R., Valiant, Gregory, Ré, Christopher

论文摘要

数据增强是一种有力的技术，可以提高图像和文本分类任务等应用程序的性能。但是，几乎没有严格了解各种增强作用的原因和方式。在这项工作中，我们考虑了一个线性变换的家族，并研究了它们对脊估计量的影响，以过度参数化的线性回归设置。首先，我们表明，保留数据标签的转换可以通过扩大培训数据的跨度来改善估计。其次，我们表明，混合数据的转换可以通过发挥正则化效果来改善估计。最后，我们验证了对MNIST的理论见解。基于洞察力，我们提出了一个增强方案，该方案通过模型对转换数据的不确定方式来搜索转换空间。我们在图像和文本数据集上验证了建议的方案。例如，我们的方法在CIFAR-100上使用Wide-Resnet-28-10在CIFAR-100上优于随机抽样方法。此外，我们实现了与CIFAR-10，CIFAR-100，SVHN和Imagenet数据集上SOTA对抗自动仪的可比精度。

Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations that preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations that mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms random sampling methods by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题