论文标题
MSDT:蒙版语言模型在文本域中得分防御
MSDT: Masked Language Model Scoring Defense in Text Domain
论文作者
论文摘要
预训练的语言模型使我们能够通过微调来处理下游任务,这有助于该模型在各种自然语言处理(NLP)任务中实现相当高的精度。来自各个网站的这种易于下载的语言模型赋予了公共用户以及一些主要机构,以赋予其现实生活中的势头。但是,最近证明,当模型被恶意用户被触发插入的中毒数据集攻击时,它们变得极为脆弱。然后,攻击者将受害者模型重新分配给公众,以吸引其他用户使用它们,在训练样本中检测到某些触发器时,模型往往会错误分类。在本文中,我们将介绍一种新颖的改进的文本后门防御方法,名为MSDT,该方法的表现优于当前特定数据集中现有的防御算法。实验结果表明,在防御文本域中的后门攻击方面,我们的方法可以有效和建设性。代码可在https://github.com/jcroh0508/msdt上找到。
Pre-trained language models allowed us to process downstream tasks with the help of fine-tuning, which aids the model to achieve fairly high accuracy in various Natural Language Processing (NLP) tasks. Such easily-downloaded language models from various websites empowered the public users as well as some major institutions to give a momentum to their real-life application. However, it was recently proven that models become extremely vulnerable when they are backdoor attacked with trigger-inserted poisoned datasets by malicious users. The attackers then redistribute the victim models to the public to attract other users to use them, where the models tend to misclassify when certain triggers are detected within the training sample. In this paper, we will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. The experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain. Code is available at https://github.com/jcroh0508/MSDT.