陷阱和替换：通过将后门攻击捕获到易于重建子网中来防御后门攻击

论文标题

陷阱和替换：通过将后门攻击捕获到易于重建子网中来防御后门攻击

Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork

论文作者

Wang, Haotao, Hong, Junyuan, Zhang, Aston, Zhou, Jiayu, Wang, Zhangyang

论文摘要

深度神经网络（DNN）容易受到后门攻击的影响。先前的工作表明，从网络中取消不希望的后门行为非常具有挑战性，因为整个网络可能会受到后门样本的影响。在本文中，我们提出了一种全新的后门防御策略，这使得从模型中删除后门样本的有害影响变得更加容易。我们的防御策略\ emph {陷阱和替换}由两个阶段组成。在第一阶段，我们将后门诱饵和捕获在一个易于重复的子网中。具体来说，我们在与轻加权分类头共享的STEM网络顶部添加了辅助图像重建头。直觉是，辅助图像重建任务鼓励STEM网络保持足够的低级视觉特征，这些视觉特征很难学习，但在语义上是正确的，而不是过于拟合易于学习但在语义上不正确的后门相关性。结果，当在后doged的数据集中接受培训时，后门很容易被诱使不受保护的分类头，因为它比共享词干更脆弱，而茎网络几乎不会中毒。在第二阶段，我们通过仅在用干净的样品的小型保留数据集中重新训练中有毒的轻加权分类头，同时固定STEM网络。结果，最终网络中的茎和分类头几乎不受后门培训样本的影响。我们根据十种不同的后门攻击评估我们的方法。我们的方法的表现优于先前的最新方法，最高$ 20.57 \％$，$ 9.80 \％$ $，$ 13.72 \％$ $ $攻击成功率和$ 3.14 \％\％$，$ 1.80 \％\％$，和1.21 \％$ 1.21 \％$ $ $ $ 1.21 \％$ $ $ $ $ 1.21 \％$ $ 1.21 $ $ $ $ $ $ $ $。代码可在线提供。

Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous works have shown it extremely challenging to unlearn the undesired backdoor behavior from the network, since the entire network can be affected by the backdoor samples. In this paper, we propose a brand-new backdoor defense strategy, which makes it much easier to remove the harmful influence of backdoor samples from the model. Our defense strategy, \emph{Trap and Replace}, consists of two stages. In the first stage, we bait and trap the backdoors in a small and easy-to-replace subnetwork. Specifically, we add an auxiliary image reconstruction head on top of the stem network shared with a light-weighted classification head. The intuition is that the auxiliary image reconstruction task encourages the stem network to keep sufficient low-level visual features that are hard to learn but semantically correct, instead of overfitting to the easy-to-learn but semantically incorrect backdoor correlations. As a result, when trained on backdoored datasets, the backdoors are easily baited towards the unprotected classification head, since it is much more vulnerable than the shared stem, leaving the stem network hardly poisoned. In the second stage, we replace the poisoned light-weighted classification head with an untainted one, by re-training it from scratch only on a small holdout dataset with clean samples, while fixing the stem network. As a result, both the stem and the classification head in the final network are hardly affected by backdoor training samples. We evaluate our method against ten different backdoor attacks. Our method outperforms previous state-of-the-art methods by up to $20.57\%$, $9.80\%$, and $13.72\%$ attack success rate and on-average $3.14\%$, $1.80\%$, and $1.21\%$ clean classification accuracy on CIFAR10, GTSRB, and ImageNet-12, respectively. Code is available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题