论文标题
根据数据表示的分布特征检测文本对抗示例
Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations
论文作者
论文摘要
尽管深度神经网络在各种机器学习任务中都取得了最先进的表现,但是通过添加小的非随机扰动来构建的对抗性示例,以正确分类的输入,成功地将高度表达的深层分类器愚弄为错误的预测。在过去的五年中,使用字符级别,文字级别,短语级别或句子级文本扰动,在过去的五年中,对自然语言任务的对抗性攻击的方法蓬勃发展。 NLP中有一些通过主动方法(例如对抗训练)来防御此类攻击的作品,但据我们所知,没有通过检测文本对抗性示例的有效的一般反应性防御方法,例如图像处理文献中可以找到。在本文中,我们建议NLP填补这一空白的两种新的反应性方法,与NLP的少数有限应用基线完全基于学到的表示表示的分布特征:我们从图像处理文献(局部内在维度(LID))中适应一种,并提出了一种新颖的一个(多种能力表示)。适应性的盖子和MDRE在IMDB数据集对字符级别,单词级和短语级攻击以及后面的两个方面,就Multinli数据集而言,获得了最新的结果。为了将来的研究,我们发布了我们的代码。
Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application baselines from NLP are based entirely on distribution characteristics of learned representations: we adapt one from the image processing literature (Local Intrinsic Dimensionality (LID)), and propose a novel one (MultiDistance Representation Ensemble Method (MDRE)). Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. For future research, we publish our code.