建模杂货零售主题分布：评估，可解释性和稳定性

论文标题

建模杂货零售主题分布：评估，可解释性和稳定性

Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

论文作者

Vega-Carrasco, Mariflor, O'sullivan, Jason, Prior, Rosie, Manolopoulou, Ioanna, Musolesi, Mirco

论文摘要

了解市场篮子背后的购物动机在杂货零售业中具有很高的商业价值。分析购物交易需要的技术可以应对杂货交易数据的数量和维度，同时保持可解释的结果。潜在的DIRICHLET分配（LDA）提供了一个合适的框架来处理杂货交易并发现广泛的客户购物动机。但是，总结LDA模型的后验分布具有挑战性，而单个LDA绘制可能不是连贯的，并且无法捕获主题不确定性。此外，LDA模型的评估主要由模型拟合度量的主导，这些措施可能无法充分捕获定性方面，例如可解释性和主题的稳定性。在本文中，我们介绍了聚类方法学，后处理后LDA绘制以总结整个后验分布并识别表示为经常性主题的语义模式。我们的方法是标准标签开关技术的替代方法，并提供了单个后部摘要的主题集以及相关的不确定性指标。此外，我们为模型评估建立了一个更全面的定义，该定义不仅基于其可能性，而且基于其连贯性，独特性和稳定性来评估主题模型。通过调查，我们为杂货零售数据领域中主题连贯性和主题相似性的解释设定了阈值。我们证明，通过我们的聚类方法选择复发主题不仅可以提高模型的可能性，而且还优于LDA的定性方面，例如可解释性和稳定性。我们在英国大型超市连锁店的一个例子上说明了我们的方法。

Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题