论文标题
规避基于潜在可分离性的后门防御
Circumventing Backdoor Defenses That Are Based on Latent Separability
论文作者
论文摘要
最近的研究表明,深度学习容易受到后门中毒的攻击。对手可以将隐藏的后门嵌入模型中,以通过仅修改一些训练数据而无需控制训练过程,以操纵其预测。目前,在各种后门中毒攻击中广泛观察到了一个有形的签名 - 在毒数据集中训练的模型往往会学习可分开的潜在代表毒和清洁样品。这种潜在的分离是如此普遍,以至于后门防御家族直接将其视为默认假设(称为潜在的可分离性假设),基于该假设,可以通过潜在空间中的聚类分析来识别毒药样本。因此,一个有趣的问题如下:潜在的分离是否不可避免地用于后门中毒攻击?这个问题是了解潜在可分离性的假设是否为防御后门中毒攻击提供了可靠的基础。在本文中,我们设计了自适应后门中毒攻击,以对此假设提出反例。我们的方法包括两个关键组成部分:(1)一组可以正确标记为其语义类(目标类别除外)的触发式植物样本,这些样本可以正规化后门学习; (2)有助于提高攻击成功率(ASR)的不对称触发种植策略,并使毒药样本的潜在表示多样化。基准数据集的大量实验验证了我们自适应攻击在绕过现有的基于潜在分离的后门防御方面的有效性。此外,我们的攻击仍然可以保持较高的攻击成功率,而清洁精度可以忽略不计。我们的研究要求国防设计师在利用潜在分离作为防御能力的假设时谨慎行事。
Recent studies revealed that deep learning is susceptible to backdoor poisoning attacks. An adversary can embed a hidden backdoor into a model to manipulate its predictions by only modifying a few training data, without controlling the training process. Currently, a tangible signature has been widely observed across a diverse set of backdoor poisoning attacks -- models trained on a poisoned dataset tend to learn separable latent representations for poison and clean samples. This latent separation is so pervasive that a family of backdoor defenses directly take it as a default assumption (dubbed latent separability assumption), based on which to identify poison samples via cluster analysis in the latent space. An intriguing question consequently follows: is the latent separation unavoidable for backdoor poisoning attacks? This question is central to understanding whether the assumption of latent separability provides a reliable foundation for defending against backdoor poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption. Our methods include two key components: (1) a set of trigger-planted samples correctly labeled to their semantic classes (other than the target class) that can regularize backdoor learning; (2) asymmetric trigger planting strategies that help to boost attack success rate (ASR) as well as to diversify latent representations of poison samples. Extensive experiments on benchmark datasets verify the effectiveness of our adaptive attacks in bypassing existing latent separation based backdoor defenses. Moreover, our attacks still maintain a high attack success rate with negligible clean accuracy drop. Our studies call for defense designers to take caution when leveraging latent separation as an assumption in their defenses.