论文标题
在DIRICHLET过程混合模型中采样用于聚类流数据
Sampling in Dirichlet Process Mixture Models for Clustering Streaming Data
论文作者
论文摘要
聚类流数据的实用工具必须足够快,以处理观测值的到达率。通常,它们也必须即时适应可能缺乏平稳性。也就是说,由于各种形式的漂移,簇数的变化等,数据统计数据可能是时间依赖的。Dirichlet过程混合模型(DPMM),其贝叶斯非参数性质允许其将其复杂性调整到数据中,似乎是流媒体案例的自然选择。但是,在其经典公式中,DPMM无法捕获数据统计中的常见漂移类型。此外,无论限制如何,现有的在线DPMM推理的现有方法太慢了,无法处理快速数据流。在这项工作中,我们建议对DPMM和已知的基于DPMM采样的非流式推理方法进行调整,以进行流媒体数据群集。我们在几个具有挑战性的设置上证明了所提出的方法的实用性,在该设置中,它在速度方面与其他方法相提并论时,它获得了最先进的结果。
Practical tools for clustering streaming data must be fast enough to handle the arrival rate of the observations. Typically, they also must adapt on the fly to possible lack of stationarity; i.e., the data statistics may be time-dependent due to various forms of drifts, changes in the number of clusters, etc. The Dirichlet Process Mixture Model (DPMM), whose Bayesian nonparametric nature allows it to adapt its complexity to the data, seems a natural choice for the streaming-data case. In its classical formulation, however, the DPMM cannot capture common types of drifts in the data statistics. Moreover, and regardless of that limitation, existing methods for online DPMM inference are too slow to handle rapid data streams. In this work we propose adapting both the DPMM and a known DPMM sampling-based non-streaming inference method for streaming-data clustering. We demonstrate the utility of the proposed method on several challenging settings, where it obtains state-of-the-art results while being on par with other methods in terms of speed.