关于反对后门攻击的对抗训练的有效性

论文标题

关于反对后门攻击的对抗训练的有效性

On the Effectiveness of Adversarial Training against Backdoor Attacks

论文作者

Gao, Yinghua, Wu, Dongxian, Zhang, Jingfeng, Gan, Guanhao, Xia, Shu-Tao, Niu, Gang, Sugiyama, Masashi

论文摘要

DNNS对大规模数据迫使从业人员从Internet收集数据而无需仔细检查的需求，这是由于不可接受的成本，这带来了后门攻击的潜在风险。后式模型总是在存在预定义的触发模式的情况下预测目标类，这可以通过中毒少量数据来轻松实现。一般而言，对抗性训练被认为可以防御后门攻击，因为它有助于模型保持其预测不变，即使我们扰动输入图像（只要在可行的范围内）。不幸的是，以前很少有研究成功地这样做。为了探索对抗性训练是否可以防御后门攻击，我们对不同的威胁模型和扰动预算进行了广泛的实验，并在对抗性培训事项中找到威胁模型。例如，通过空间对抗示例进行的对抗训练为基于补丁的后门攻击提供了明显的鲁棒性。我们进一步提出了一种混合策略，该策略在不同的后门攻击中提供了令人满意的鲁棒性。

DNNs' demand for massive data forces practitioners to collect data from the Internet without careful check due to the unacceptable cost, which brings potential risks of backdoor attacks. A backdoored model always predicts a target class in the presence of a predefined trigger pattern, which can be easily realized via poisoning a small amount of data. In general, adversarial training is believed to defend against backdoor attacks since it helps models to keep their prediction unchanged even if we perturb the input image (as long as within a feasible range). Unfortunately, few previous studies succeed in doing so. To explore whether adversarial training could defend against backdoor attacks or not, we conduct extensive experiments across different threat models and perturbation budgets, and find the threat model in adversarial training matters. For instance, adversarial training with spatial adversarial examples provides notable robustness against commonly-used patch-based backdoor attacks. We further propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题