产生标签的凝聚力和形成良好的对抗性主张

论文标题

产生标签的凝聚力和形成良好的对抗性主张

Generating Label Cohesive and Well-Formed Adversarial Claims

论文作者

Atanasova, Pepa, Wright, Dustin, Augenstein, Isabelle

论文摘要

对抗性攻击揭示了训练有素的模型的重要漏洞和缺陷。一种有效的攻击类型是通用的对抗触发器，它们是单独的n-grams，当将其附加到攻击中的类实例上时，可以欺骗模型来预测目标类。但是，对于诸如事实检查之类的推理任务，这些触发因素通常会无意中颠倒它们被插入的实例的含义。此外，此类攻击会产生语义上的毫无意义的输入，因为它们只是将触发器与现有样品相连。在这里，我们调查了如何针对保留地面真相含义并在语义上有效的事实检查系统的对抗性攻击。我们通过共同将事实检查模型的目标类损失和辅助自然语言推理模型的目标类损失共同最大程度地减少目标类损失，扩展了用于通用触发器生成的热水攻击算法。然后，我们训练有条件的语言模型来生成具有语义上有效的语句，其中包括找到的通用触发器。我们发现，产生的攻击比以前的工作更好地保持了主张的方向性和语义有效性。

Adversarial attacks reveal important vulnerabilities and flaws of trained models. One potent type of attack are universal adversarial triggers, which are individual n-grams that, when appended to instances of a class under attack, can trick a model into predicting a target class. However, for inference tasks such as fact checking, these triggers often inadvertently invert the meaning of instances they are inserted in. In addition, such attacks produce semantically nonsensical inputs, as they simply concatenate triggers to existing samples. Here, we investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning and are semantically valid. We extend the HotFlip attack algorithm used for universal trigger generation by jointly minimising the target class loss of a fact checking model and the entailment class loss of an auxiliary natural language inference model. We then train a conditional language model to generate semantically valid statements, which include the found universal triggers. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题