论文标题
使用语言模型的监督信号弱监督的文本分类
Weakly Supervised Text Classification using Supervision Signals from a Language Model
论文作者
论文摘要
以弱监督的方式解决文本分类对于人类注释稀缺的现实应用非常重要。在本文中,我们建议用披肩样式的蒙版语言模型来查询掩盖的模型,以获取监督信号。我们设计了一个将文档本身结合在一起的提示,“本文正在谈论[面具]。”蒙版语言模型可以为[蒙版]令牌生成单词。总结文档内容的生成单词可以用作监督信号。我们提出了一个潜在变量模型,以学习一个单词分配学习者,该模型将生成的单词与预定义的类别相关联,并且在不使用任何带注释的数据的情况下同时将单词与文档分类器相关联。在三个数据集,AGNEW,20NEWSGROUP和UCINEWS上进行评估表明,我们的方法的表现可以超过2%,4%和3%。
Solving text classification in a weakly supervised manner is important for real-world applications where human annotations are scarce. In this paper, we propose to query a masked language model with cloze style prompts to obtain supervision signals. We design a prompt which combines the document itself and "this article is talking about [MASK]." A masked language model can generate words for the [MASK] token. The generated words which summarize the content of a document can be utilized as supervision signals. We propose a latent variable model to learn a word distribution learner which associates generated words to pre-defined categories and a document classifier simultaneously without using any annotated data. Evaluation on three datasets, AGNews, 20Newsgroups, and UCINews, shows that our method can outperform baselines by 2%, 4%, and 3%.