论文标题
文本分类器的通用对抗性政策
A Universal Adversarial Policy for Text Classifiers
论文作者
论文摘要
发现普遍的对抗性扰动的存在对对抗性学习领域产生了很大的理论和实际影响。在文本域中,大多数通用研究都集中在添加到所有文本中的对抗前缀上。但是,与视觉域不同,在不同输入中添加相同的扰动会导致明显的不自然输入。因此,我们介绍了一种新的通用对抗性设置 - 一种通用的对抗性政策,它具有其他通用攻击的许多优势,但也导致有效的文本 - 从而使其在实践中具有重要意义。我们通过在许多文本上学习保存文本更改的一组语义集,通过学习单个搜索策略来实现这一目标。这种公式是普遍的,因为该政策成功地在新文本上找到了对抗性示例。我们的方法使用文本扰动,这些扰动已被广泛显示,以在非宇宙设置(特定的同义词替代品)中产生自然攻击。我们建议使用强大的基线方法,该方法使用增强学习。它具有概括(从几乎不到500个培训文本)的能力表明,文本域中也存在通用的对抗模式。
Discovering the existence of universal adversarial perturbations had large theoretical and practical impacts on the field of adversarial learning. In the text domain, most universal studies focused on adversarial prefixes which are added to all texts. However, unlike the vision domain, adding the same perturbation to different inputs results in noticeably unnatural inputs. Therefore, we introduce a new universal adversarial setup - a universal adversarial policy, which has many advantages of other universal attacks but also results in valid texts - thus making it relevant in practice. We achieve this by learning a single search policy over a predefined set of semantics preserving text alterations, on many texts. This formulation is universal in that the policy is successful in finding adversarial examples on new texts efficiently. Our approach uses text perturbations which were extensively shown to produce natural attacks in the non-universal setup (specific synonym replacements). We suggest a strong baseline approach for this formulation which uses reinforcement learning. It's ability to generalise (from as few as 500 training texts) shows that universal adversarial patterns exist in the text domain as well.