MM-BD：使用任意后门模式类型的后门攻击的训练后检测使用最大保证金统计数据

论文标题

MM-BD：使用任意后门模式类型的后门攻击的训练后检测使用最大保证金统计数据

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

论文作者

Wang, Hang, Xiang, Zhen, Miller, David J., Kesidis, George

论文摘要

后门攻击是针对深神经网络分类器的对抗性威胁的一种重要类型，其中一个或多个源类的测试样本将（MIS）分类为攻击者的目标类别时，当嵌入后门模式时。在本文中，我们专注于文献中通常考虑的训练后的后门防御场景，后卫的目的是检测训练有素的分类器是否在没有任何训练集的情况下进行了后门攻击。许多训练后探测器旨在检测使用一个或几个特定的后门嵌入功能（例如，补丁或添加剂攻击）的攻击。当攻击者使用的后门嵌入功能（防御者未知）与防御者假定的后门嵌入功能不同时，这些检测器可能会失败。相比之下，我们提出了一种训练后防御，该防御能够检测具有任意类型的后门嵌入的后门攻击，而无需对后门嵌入类型做出任何假设。我们的探测器利用了独立于后门嵌入机制的后门攻击对分类器输出层的景观的影响。对于每个类，估计最大保证金统计量。然后，通过将无监督的异常检测器应用于这些统计数据来进行检测推断。因此，我们的检测器不需要任何合法的干净样本，并且可以有效地检测具有任意数量源类的后门攻击。在四种不同类型的后门模式以及各种攻击配置方面，在四个数据集中证明了这些优点比几种最先进的方法。最后，我们提出了一种新颖的，一旦进行检测，用于后门缓解措施。缓解方法是在首次IEEE Trojan拆除比赛中获得亚军。该代码可在线可用。

Backdoor attacks are an important type of adversarial threat against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker's target class when a backdoor pattern is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor-attacked without any access to the training set. Many post-training detectors are designed to detect attacks that use either one or a few specific backdoor embedding functions (e.g., patch-replacement or additive attacks). These detectors may fail when the backdoor embedding function used by the attacker (unknown to the defender) is different from the backdoor embedding function assumed by the defender. In contrast, we propose a post-training defense that detects backdoor attacks with arbitrary types of backdoor embeddings, without making any assumptions about the backdoor embedding type. Our detector leverages the influence of the backdoor attack, independent of the backdoor embedding mechanism, on the landscape of the classifier's outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated. Detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector does not need any legitimate clean samples, and can efficiently detect backdoor attacks with arbitrary numbers of source classes. These advantages over several state-of-the-art methods are demonstrated on four datasets, for three different types of backdoor patterns, and for a variety of attack configurations. Finally, we propose a novel, general approach for backdoor mitigation once a detection is made. The mitigation approach was the runner-up at the first IEEE Trojan Removal Competition. The code is online available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题