特洛伊木马训练以打破深度学习中的后门攻击的防御能力

论文标题

特洛伊木马训练以打破深度学习中的后门攻击的防御能力

Trojan Horse Training for Breaking Defenses against Backdoor Attacks in Deep Learning

论文作者

Rajabi, Arezoo, Ramasubramanian, Bhaskar, Poovendran, Radha

论文摘要

使用深神经网络的机器学习（ML）模型容易受到后门攻击的影响。这种攻击涉及对手插入（隐藏的）触发器。结果，任何包含触发器的输入都会导致神经网络将输入错误分类为（单个）目标类，同时对其他没有触发的其他输入进行分类。包含后门的ML模型称为特洛伊木马型号。当仅可用时，后门会对安全关键的网络和网络物理系统产生严重的后果。已经开发并说明了防御机制，以便在单目标后门攻击中以> 96％的速度来区分特洛伊木马模型的输出和非trojan模型。了解防御机制的局限性需要建立机制失败的示例。当前的单目标后门攻击需要每个目标类别一个触发器。我们引入了一种新的，更普遍的攻击，将使单个触发器能够导致错误分类为一个以上的目标类。这样的错误分类将取决于输入所属的真实（实际）类。我们称此类别的攻击多目标后门攻击。我们证明，可以训练具有单目标或多目标触发器的特洛伊木马模型，以便将降低试图区分来自特洛伊木马的输出的防御机制的准确性和降低非trojan模型。我们的方法使用非trojan模型作为特洛伊木马模型的老师，并解决了特洛伊木马模型和防御机制之间的最小优化问题。经验评估表明，我们的培训程序将最先进的防御机制的准确性从> 96降低到0％。

Machine learning (ML) models that use deep neural networks are vulnerable to backdoor attacks. Such attacks involve the insertion of a (hidden) trigger by an adversary. As a consequence, any input that contains the trigger will cause the neural network to misclassify the input to a (single) target class, while classifying other inputs without a trigger correctly. ML models that contain a backdoor are called Trojan models. Backdoors can have severe consequences in safety-critical cyber and cyber physical systems when only the outputs of the model are available. Defense mechanisms have been developed and illustrated to be able to distinguish between outputs from a Trojan model and a non-Trojan model in the case of a single-target backdoor attack with accuracy > 96 percent. Understanding the limitations of a defense mechanism requires the construction of examples where the mechanism fails. Current single-target backdoor attacks require one trigger per target class. We introduce a new, more general attack that will enable a single trigger to result in misclassification to more than one target class. Such a misclassification will depend on the true (actual) class that the input belongs to. We term this category of attacks multi-target backdoor attacks. We demonstrate that a Trojan model with either a single-target or multi-target trigger can be trained so that the accuracy of a defense mechanism that seeks to distinguish between outputs coming from a Trojan and a non-Trojan model will be reduced. Our approach uses the non-Trojan model as a teacher for the Trojan model and solves a min-max optimization problem between the Trojan model and defense mechanism. Empirical evaluations demonstrate that our training procedure reduces the accuracy of a state-of-the-art defense mechanism from >96 to 0 percent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题