论文标题
在深文分类器中检测后门
Detecting Backdoors in Deep Text Classifiers
论文作者
论文摘要
深层神经网络容易受到对抗性攻击的影响,例如后门攻击,其中恶意对手会在训练过程中损害模型,从而通过将特定的单词或短语附加到输入中,可以在测试时间触发特定的行为。本文考虑了诊断模型是否已被损害的问题,如果是,则确定后门触发器。我们提出了第一个强大的防御机制,该机制将概括为对文本分类模型的几次后门攻击,而没有对攻击类型的事先了解,也不需要访问任何(可能遭到损害的)培训资源。我们的实验表明,我们的技术在防御最新的后门攻击方面非常准确,包括数据中毒和体重中毒,跨各种文本分类任务和模型体系结构。我们的代码将在接受后公开提供。
Deep neural networks are vulnerable to adversarial attacks, such as backdoor attacks in which a malicious adversary compromises a model during training such that specific behaviour can be triggered at test time by attaching a specific word or phrase to an input. This paper considers the problem of diagnosing whether a model has been compromised and if so, identifying the backdoor trigger. We present the first robust defence mechanism that generalizes to several backdoor attacks against text classification models, without prior knowledge of the attack type, nor does our method require access to any (potentially compromised) training resources. Our experiments show that our technique is highly accurate at defending against state-of-the-art backdoor attacks, including data poisoning and weight poisoning, across a range of text classification tasks and model architectures. Our code will be made publicly available upon acceptance.