通过不变神经元变换破裂的白盒DNN水印

论文标题

通过不变神经元变换破裂的白盒DNN水印

Cracking White-box DNN Watermarks via Invariant Neuron Transforms

论文作者

Yan, Yifan, Pan, Xudong, Wang, Yining, Zhang, Mi, Yang, Min

论文摘要

最近，如何保护深神经网络（DNN）的知识产权（IP）成为AI行业的主要关注点。为了打击潜在的模型盗版，最近的著作探索了各种水印策略，以将秘密身份信息嵌入到预测行为或目标模型的内部（例如权重和神经元激活）中。牺牲了较少的功能性并涉及更多有关目标模型的知识，据称水印方案的后一个分支（即白盒模型水印）被认为是准确，可信和安全的，以应对大多数已知的水印去除攻击，并在行业中进行了新的研究工作和应用。在本文中，我们提出了第一次有效的去除攻击，该攻击几乎破解了所有现有的白盒水印方案，这些方案证明没有绩效开销，也没有必要的先验知识。通过分析神经元粒度的这些IP保护机制，我们首次发现它们对局部神经元组的一组脆弱特征的共同依赖性，所有这些都可以被我们提出的不变神经元变换链任意篡改。在$ 9 $最新的白色盒子水印方案和一套广泛的行业级别DNN体系结构上，我们的攻击首次将受保护模型中的嵌入式身份消息减少为几乎是随机的。同时，与已知的去除攻击不同，我们的攻击不需要关于训练数据分布或采用的水印算法的先验知识，并且使模型功能完整。

Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry. In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On $9$ state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.

下载PDF全文

下载文献需遵守相关版权规定

论文标题