基数估计的不可网格素描

论文标题

基数估计的不可网格素描

Non-Mergeable Sketching for Cardinality Estimation

论文作者

Pettie, Seth, Wang, Dingyu, Yin, Longhui

论文摘要

基数估计可能是可以通过素描解决的最简单的非平凡统计问题。超置log，Minhash和PCSA等工业部署的草图是可合并的，这意味着可以在分布式环境中勾勒出大型数据集，然后合并为整个数据集的单个草图。在过去的十年中，已经开发了各种草图，这些草图是不可网格的，但由于其他原因而有吸引力。它们更简单，其基数估计严格公正，并且差异大大降低。我们根据其内存变化产品（MVP）在相当级别的竞争环境中评估素描方案。例如，一个占$ 500万$位的草图，其相对差异为$ 2/m $（标准错误$ \ sqrt {2/m} $）的MVP为$ 10 $。我们的贡献如下。科恩（Cohen）和特（Ting）独立地发现了我们所说的martingale变换，以将合并的草图转换为不可粘的草图。我们提出了一种更简单的方法来分析Martingale型草图的限制MVP。我们证明，\ martingale {}变换在不可网络的世界中是最佳的，并且\ martingale {} \ fishmonger {}尤其是可线化的草图中最佳的，MVP的MVP为$ h_0/2 \ y 1.63 $。例如，这是有间接的证据表明，要达到1 \％的标准误差，我们不能比2千字节草图做得更好。 \ martingale {} \ fishmonger {}既不简单也不实用。我们开发了一个名为\ curtain {}的新的合并草图，它在简单性和效率之间达到了一个很好的平衡，并证明\ martingale {} \ curtain {}具有限制$ \ mvp \ mvp \ of 2.31 $。它可以使用$ o（1）$访问来更新，并且比\ martingale {} \ loglog的经验差异较低，这是一种实用的不易加版Hyperloglog版本。

Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies $5m$ bits and whose relative variance is $2/m$ (standard error $\sqrt{2/m}$) has an MVP of $10$. Our contributions are as follows. Cohen and Ting independently discovered what we call the Martingale transform for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches. We prove that the \Martingale{} transform is optimal in the non-mergeable world, and that \Martingale{} \fishmonger{} in particular is optimal among linearizable sketches, with an MVP of $H_0/2 \approx 1.63$. E.g., this is circumstantial evidence that to achieve 1\% standard error, we cannot do better than a 2 kilobyte sketch. \Martingale{} \fishmonger{} is neither simple nor practical. We develop a new mergeable sketch called \Curtain{} that strikes a nice balance between simplicity and efficiency, and prove that \Martingale{} \Curtain{} has limiting $\MVP\approx 2.31$. It can be updated with $O(1)$ memory accesses and it has lower empirical variance than \Martingale{} \LogLog, a practical non-mergeable version of HyperLogLog.

下载PDF全文

下载文献需遵守相关版权规定

论文标题