论文标题
基数估计的不可网格素描
Non-Mergeable Sketching for Cardinality Estimation
论文作者
论文摘要
基数估计可能是可以通过素描解决的最简单的非平凡统计问题。超置log,Minhash和PCSA等工业部署的草图是可合并的,这意味着可以在分布式环境中勾勒出大型数据集,然后合并为整个数据集的单个草图。在过去的十年中,已经开发了各种草图,这些草图是不可网格的,但由于其他原因而有吸引力。它们更简单,其基数估计严格公正,并且差异大大降低。 我们根据其内存变化产品(MVP)在相当级别的竞争环境中评估素描方案。例如,一个占$ 500万$位的草图,其相对差异为$ 2/m $(标准错误$ \ sqrt {2/m} $)的MVP为$ 10 $。我们的贡献如下。 科恩(Cohen)和特(Ting)独立地发现了我们所说的martingale变换,以将合并的草图转换为不可粘的草图。我们提出了一种更简单的方法来分析Martingale型草图的限制MVP。 我们证明,\ martingale {}变换在不可网络的世界中是最佳的,并且\ martingale {} \ fishmonger {}尤其是可线化的草图中最佳的,MVP的MVP为$ h_0/2 \ y 1.63 $。例如,这是有间接的证据表明,要达到1 \%的标准误差,我们不能比2千字节草图做得更好。 \ martingale {} \ fishmonger {}既不简单也不实用。我们开发了一个名为\ curtain {}的新的合并草图,它在简单性和效率之间达到了一个很好的平衡,并证明\ martingale {} \ curtain {}具有限制$ \ mvp \ mvp \ of 2.31 $。它可以使用$ o(1)$访问来更新,并且比\ martingale {} \ loglog的经验差异较低,这是一种实用的不易加版Hyperloglog版本。
Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies $5m$ bits and whose relative variance is $2/m$ (standard error $\sqrt{2/m}$) has an MVP of $10$. Our contributions are as follows. Cohen and Ting independently discovered what we call the Martingale transform for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches. We prove that the \Martingale{} transform is optimal in the non-mergeable world, and that \Martingale{} \fishmonger{} in particular is optimal among linearizable sketches, with an MVP of $H_0/2 \approx 1.63$. E.g., this is circumstantial evidence that to achieve 1\% standard error, we cannot do better than a 2 kilobyte sketch. \Martingale{} \fishmonger{} is neither simple nor practical. We develop a new mergeable sketch called \Curtain{} that strikes a nice balance between simplicity and efficiency, and prove that \Martingale{} \Curtain{} has limiting $\MVP\approx 2.31$. It can be updated with $O(1)$ memory accesses and it has lower empirical variance than \Martingale{} \LogLog, a practical non-mergeable version of HyperLogLog.