论文标题
聊天预期:学习操纵黑盒神经对话模型
Chat as Expected: Learning to Manipulate Black-box Neural Dialogue Models
论文作者
论文摘要
最近,基于神经网络的对话系统在我们日益数字化的社会中已变得无处不在。但是,由于它们固有的不透明性,一些最近对使用神经模型的一些担忧开始受到重视。实际上,故意或无意的行为可能导致对话系统产生不适当的反应。因此,在本文中,我们研究是否可以学习制作导致黑盒神经对话模型的输入句子,以使其输出包含目标词或匹配目标句子。我们提出了一个基于增强学习的模型,该模型可以自动生成此类所需的输入。对训练有素的最先进的神经对话模型进行了广泛的实验表明,我们的方法可以成功地寻找所需的输入,这些输入在相当多的情况下导致目标输出。因此,我们的工作揭示了受操纵神经对话模型的潜力,这激发了启发并打开了制定捍卫它们的策略的大门。
Recently, neural network based dialogue systems have become ubiquitous in our increasingly digitalized society. However, due to their inherent opaqueness, some recently raised concerns about using neural models are starting to be taken seriously. In fact, intentional or unintentional behaviors could lead to a dialogue system to generate inappropriate responses. Thus, in this paper, we investigate whether we can learn to craft input sentences that result in a black-box neural dialogue model being manipulated into having its outputs contain target words or match target sentences. We propose a reinforcement learning based model that can generate such desired inputs automatically. Extensive experiments on a popular well-trained state-of-the-art neural dialogue model show that our method can successfully seek out desired inputs that lead to the target outputs in a considerable portion of cases. Consequently, our work reveals the potential of neural dialogue models to be manipulated, which inspires and opens the door towards developing strategies to defend them.