论文标题
房间中的大象:评估NLP中对抗示例的评估框架
Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP
论文作者
论文摘要
对抗性示例是一个输入,该输入是由机器学习模型始终错误分类的小型扰动转换的。尽管提出了许多用于生成文本数据的对抗性示例的方法,但评估这些对抗性示例的质量并不是微不足道的,因为较小的扰动(例如,在句子中更改单词)可能会导致其含义,可读性,分类和分类标签的重大变化。在本文中,我们提出了一个评估框架,该框架由一组自动评估指标和人类评估指南组成,以严格评估基于上述特性的对抗性示例的质量。我们尝试了六种基准攻击方法,发现某些方法会产生较差的可读性和内容保存的对抗示例。我们还了解到,多个因素可能会影响攻击性能,例如分类器的文本输入和体系结构的长度。
An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning, readability and classification label. In this paper, we propose an evaluation framework consisting of a set of automatic evaluation metrics and human evaluation guidelines, to rigorously assess the quality of adversarial examples based on the aforementioned properties. We experiment with six benchmark attacking methods and found that some methods generate adversarial examples with poor readability and content preservation. We also learned that multiple factors could influence the attacking performance, such as the length of the text inputs and architecture of the classifiers.