短而嘈杂的文本流的图形卷积主题模型

论文标题

短而嘈杂的文本流的图形卷积主题模型

A Graph Convolutional Topic Model for Short and Noisy Text Streams

论文作者

Van Linh, Ngo, Bach, Tran Xuan, Than, Khoat

论文摘要

从数据流中学习隐藏的主题已成为绝对必要的，但带来了挑战性的问题，例如概念漂移以及简短和嘈杂的数据。利用先验知识丰富主题模型是应对这些挑战的潜在解决方案之一。从人类知识（例如WordNet）或预训练的模型（例如Word2Vec）中得出的先验知识非常有价值，并且对于帮助主题模型的工作效果更好。但是，在数据不断地到达的流媒体环境中，现有研究仅限于有效利用这些资源。特别是，一个包含有意义的单词关系的知识图被忽略。在本文中，为了有效利用知识图，我们提出了一个新颖的图形卷积主题模型（GCTM），该模型（GCTM）将图形卷积网络（GCN）集成到主题模型和学习方法中，该方法同时了解了数据流的网络和主题模型。在每个Minibatch中，我们的方法不仅可以利用外部知识图，而且可以平衡外部和旧知识以在新数据上表现良好。我们进行了广泛的实验，以使用人类知识图（WordNet）和根据预训练的单词嵌入（Word2Vec）构建的图来评估我们的方法。实验结果表明，就概率预测度量和主题连贯性而言，我们的方法比最先进的基线要取得的性能明显好得多。特别是，我们的方法在处理简短文本以及概念漂移时可以很好地工作。 GCTM的实现可在\ url {https://github.com/bachtranxuan/gctm.git}中获得。

Learning hidden topics from data streams has become absolutely necessary but posed challenging problems such as concept drift as well as short and noisy data. Using prior knowledge to enrich a topic model is one of potential solutions to cope with these challenges. Prior knowledge that is derived from human knowledge (e.g. Wordnet) or a pre-trained model (e.g. Word2vec) is very valuable and useful to help topic models work better. However, in a streaming environment where data arrives continually and infinitely, existing studies are limited to exploiting these resources effectively. Especially, a knowledge graph, that contains meaningful word relations, is ignored. In this paper, to aim at exploiting a knowledge graph effectively, we propose a novel graph convolutional topic model (GCTM) which integrates graph convolutional networks (GCN) into a topic model and a learning method which learns the networks and the topic model simultaneously for data streams. In each minibatch, our method not only can exploit an external knowledge graph but also can balance the external and old knowledge to perform well on new data. We conduct extensive experiments to evaluate our method with both a human knowledge graph (Wordnet) and a graph built from pre-trained word embeddings (Word2vec). The experimental results show that our method achieves significantly better performances than state-of-the-art baselines in terms of probabilistic predictive measure and topic coherence. In particular, our method can work well when dealing with short texts as well as concept drift. The implementation of GCTM is available at \url{https://github.com/bachtranxuan/GCTM.git}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题