预训练是一个热门话题：上下文化的文档嵌入改善主题连贯性

论文标题

预训练是一个热门话题：上下文化的文档嵌入改善主题连贯性

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

论文作者

Bianchi, Federico, Terragni, Silvia, Hovy, Dirk

论文摘要

主题模型从文档中提取单词组，其解释作为主题有望可以更好地理解数据。但是，由此产生的单词组通常不连贯，使它们更难解释。最近，神经主题模型显示了整体连贯性的改善。同时，上下文嵌入一般的神经模型艺术状态。在本文中，我们将情境化表示与神经主题模型相结合。我们发现，与传统的单词袋主题模型和最近的神经模型相比，我们的方法产生的主题更有意义，更连贯。我们的结果表明，语言模型的未来改进将转化为更好的主题模型。

Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题