论文标题
更安全:一种无结构的方法,用于对对抗单词替换的鲁棒性认证
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
论文作者
论文摘要
最先进的NLP模型通常可以被人类统一的转变(例如同义词替代)所欺骗。出于安全原因,至关重要的是,开发具有认证鲁棒性的模型,可以证明不能通过任何可能的同义词替换来改变预测。在这项工作中,我们提出了一种基于新的随机平滑技术的认证鲁棒方法,该方法通过在输入句子上应用随机单词替换来构建随机合奏,并利用合奏的统计属性来证明鲁棒性。我们的方法简单且不结构,因为它仅需要模型输出的黑框查询,因此可以应用于任何预训练的模型(例如BERT)和任何类型的模型(世界级别或子字级)。我们的方法极大地胜过最新的最新方法,用于IMDB和Amazon文本分类任务的认证鲁棒性。据我们所知,我们是第一项在诸如BERT之类的大型系统上获得认证的鲁棒性的第一项工作,具有有意义的认证准确性。
State-of-the-art NLP models can often be fooled by human-unaware transformations such as synonymous word substitution. For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution. In this work, we propose a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness. Our method is simple and structure-free in that it only requires the black-box queries of the model outputs, and hence can be applied to any pre-trained models (such as BERT) and any types of models (world-level or subword-level). Our method significantly outperforms recent state-of-the-art methods for certified robustness on both IMDB and Amazon text classification tasks. To the best of our knowledge, we are the first work to achieve certified robustness on large systems such as BERT with practically meaningful certified accuracy.