论文标题
通过共享空间的投影简化多语言新闻聚类
Simplifying Multilingual News Clustering Through Projection From a Shared Space
论文作者
论文摘要
组织和聚类多语言新闻文章的任务对于实时遵循新闻报道至关重要。这项任务的大多数方法都集中在高资源语言(主要是英语)上,而低资源的语言被忽略了。考虑到这一点,我们提出了一个更简单的在线系统,该系统能够在不依赖于语言特定功能的情况下聚集传入的文档流。我们从经验上证明,用作文档表示形式的多语言上下文嵌入可以显着提高聚类质量。我们通过消除建筑单语群的前提来挑战以前的跨语言方法。我们将聚类过程建模为一组线性分类器,以汇总类似的文档,并通过以在线方式合并密切相关的多语言群集。我们的系统在多语言新闻流集群数据集上实现了最先进的结果,我们为以多种语言进行了新的评估,以进行零拍新闻集群的新评估。我们将代码作为开源。
The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time. Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded. With that in mind, we present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features. We empirically demonstrate that the use of multilingual contextual embeddings as the document representation significantly improves clustering quality. We challenge previous crosslingual approaches by removing the precondition of building monolingual clusters. We model the clustering process as a set of linear classifiers to aggregate similar documents, and correct closely-related multilingual clusters through merging in an online fashion. Our system achieves state-of-the-art results on a multilingual news stream clustering dataset, and we introduce a new evaluation for zero-shot news clustering in multiple languages. We make our code available as open-source.