论文标题

prosit!潜在变量发现具有渐进性相似性阈值

ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

论文作者

Fornaciari, Tommaso, Hovy, Dirk, Bianchi, Federico

论文摘要

探索潜在文档维度的最常见方法是主题模型和聚类方法。但是,主题模型有几个缺点:例如,他们要求我们选择先验的潜在维度的数量,并且结果是随机的。大多数聚类方法都有相同的问题,并且以各种方式缺乏灵活性,例如不考虑不同主题对单个文档的影响,从而迫使单词描述符属于单个主题(硬群集)或一定依靠单词表示。我们提出了渐进性相似性阈值 - pasit,一种确定性和可解释的方法,对输入格式的不可知论,它找到了最佳的潜在维度数,并且只有两个超参数可以通过网格搜索有效地设置。我们将该方法与四个基准数据集的广泛主题模型和聚类方法进行了比较。在大多数情况下,proSIT匹配或胜过其他方法的六个指标相干性和独特性,从而产生可复制的,确定性的结果。

The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on single documents, forcing word-descriptors to belong to a single topic (hard-clustering) or necessarily relying on word representations. We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and interpretable method, agnostic to the input format, that finds the optimal number of latent dimensions and only has two hyper-parameters, which can be set efficiently via grid search. We compare this method with a wide range of topic models and clustering methods on four benchmark data sets. In most setting, ProSiT matches or outperforms the other methods in terms six metrics of topic coherence and distinctiveness, producing replicable, deterministic results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源