解释不能被信任：隐秘有效的对抗性扰动，以防止可解释的深度学习

论文标题

解释不能被信任：隐秘有效的对抗性扰动，以防止可解释的深度学习

Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning

论文作者

Abdukhamidov, Eldor, Abuhamad, Mohammed, Woo, Simon S., Chan-Tin, Eric, Abuhmed, Tamer

论文摘要

由于其出色的性能，深度学习方法在各种应用中引起了人们的关注。为了探索这种高性能与适当使用数据工件的关系以及给定任务的准确问题表达方式，解释模型已成为开发基于深度学习的系统的关键组成部分。解释模型使人们能够理解深度学习模型的内部运作，并在检测输入数据中滥用工件时具有安全感。与预测模型相似，解释模型也容易受到对抗输入的影响。这项工作介绍了两次攻击，即Adved和AdvedGeed $^{+} $，它们既欺骗了目标深度学习模型又欺骗了耦合的解释模型。我们评估了针对两个深度学习模型体系结构的拟议攻击的有效性，以及四个代表不同类别的解释模型的解释模型。我们的实验包括使用各种攻击框架实施攻击。我们还探索了针对此类攻击的潜在对策。我们的分析表明，在欺骗深度学习模型及其口译员方面，我们的攻击有效性，并强调了改善和规避攻击的见解。

Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of the inner workings of deep learning models and offer a sense of security in detecting the misuse of artifacts in the input data. Similar to prediction models, interpretation models are also susceptible to adversarial inputs. This work introduces two attacks, AdvEdge and AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled interpretation model. We assess the effectiveness of proposed attacks against two deep learning model architectures coupled with four interpretation models that represent different categories of interpretation models. Our experiments include the attack implementation using various attack frameworks. We also explore the potential countermeasures against such attacks. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters, and highlights insights to improve and circumvent the attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题