无监督的数据增强，没有幼稚的增强，而没有未标记的数据

论文标题

无监督的数据增强，没有幼稚的增强，而没有未标记的数据

Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data

论文作者

Lowell, David, Howard, Brian E., Lipton, Zachary C., Wallace, Byron C.

论文摘要

无监督的数据增强（UDA）是一种半监督的技术，它应用一致性损失来惩罚模型对（a）观察（未标记）示例的预测之间的差异；（b）通过数据增强产生的相应的“噪声”示例。尽管UDA在文本分类方面已获得知名度，但打开问题持续了哪些设计决策，以及如何扩展方法以对任务进行序列标记。该方法最近获得了文本分类的吸引力。在本文中，我们重新检查UDA并证明其在几个顺序任务上的功效。我们的主要贡献是对UDA的一项实证研究，以确定算法赋予NLP益处的哪些组成部分。值得注意的是，尽管先前的工作强调了包括反向翻译在内的聪明的增强技术的使用，但我们发现，分配给观察到的和随机替代单词的预测之间的一致性通常与这些复杂的扰动模型相比会产生可比的（或更大）的好处。此外，我们发现，应用其一致性损失可以带来有意义的收益，而无需任何未标记的数据，即在标准监督环境中。简而言之：UDA不必不受监督，也不需要复杂的数据增强才能有效。

Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions on (a) observed (unlabeled) examples; and (b) corresponding 'noised' examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to extend the method to sequence labeling tasks. This method has recently gained traction for text classification. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these complex perturbation models. Furthermore, we find that applying its consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short: UDA need not be unsupervised, and does not require complex data augmentation to be effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题