论文标题

朝着寻找准标识符的更好范围

Towards Better Bounds for Finding Quasi-Identifiers

论文作者

Hildebrant, Ryan, Le, Quoc-Tung, Ta, Duy-Hoang, Vu, Hoa T.

论文摘要

我们重新审视了Motwani和Xu(2008)引入的小$ε$分离键的问题。在此问题中,输入为$ m $ -DimeNional元组$ x_1,x_2,\ ldots,x_n $。目标是找到一小部分坐标子集,该坐标将至少分开$(1-ε){n \选择2} $的元组。他们提供了一种快速算法,该算法以$θ(m/ε)$元组的方式运行。我们表明,样本量可以提高到$θ(m/\sqrtε)$。我们的算法也享有更快的运行时间。为了获得此结果,我们在样本量上提供上限和下限,以解决以下决策问题。给定一部分坐标$ a $,如果$ a $分开少于$(1-ε){n \ select 2} $ pairs,如果$ a $分开所有对,则拒绝。对于所有$ a $,该算法必须具有至少$ 1-δ$的概率。我们证明了基于采样的算法: - $θ(m/\sqrtε)$样品是足够且必要的,因此$δ\ leq e^{ - m} $和 - $ω(\ sqrt {\ frac {\ log m}ε})$样本是必要的,因此$δ$是常数。 我们的分析是基于球界问题的受约束版本。我们认为我们的分析可能具有独立的兴趣。我们还研究了一个相关问题,要求以下草图算法:对于给定参数$α,k $和$ε$,该算法最多可以返回$ k $的坐标$ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ k $的估计值,估计$ a $ a $ a $ a $ a $ a $ a $(1 \ pmp pmpmpmpm的$)$ fection $ fection $ fection $ cops $ cops $ cops $ cops $ festive。我们表明,即使对于常量的$α$和成功概率,这种素描算法也必须使用$ω(Mk \logε^{ - 1})$ space of Space;另一方面,为此目的,均匀采样产生了大小$θ(\ frac {mk \ log m} {αε^2})$的草图。

We revisit the problem of finding small $ε$-separation keys introduced by Motwani and Xu (2008). In this problem, the input is $m$-dimensional tuples $x_1,x_2,\ldots,x_n $. The goal is to find a small subset of coordinates that separates at least $(1-ε){n \choose 2}$ pairs of tuples. They provided a fast algorithm that runs on $Θ(m/ε)$ tuples sampled uniformly at random. We show that the sample size can be improved to $Θ(m/\sqrtε)$. Our algorithm also enjoys a faster running time. To obtain this result, we provide upper and lower bounds on the sample size to solve the following decision problem. Given a subset of coordinates $A$, reject if $A$ separates fewer than $(1-ε){n \choose 2}$ pairs, and accept if $A$ separates all pairs. The algorithm must be correct with probability at least $1-δ$ for all $A$. We show that for algorithms based on sampling: - $Θ(m/\sqrtε)$ samples are sufficient and necessary so that $δ\leq e^{-m}$ and - $Ω(\sqrt{\frac{\log m}ε})$ samples are necessary so that $δ$ is a constant. Our analysis is based on a constrained version of the balls-into-bins problem. We believe our analysis may be of independent interest. We also study a related problem that asks for the following sketching algorithm: with given parameters $α,k$ and $ε$, the algorithm takes a subset of coordinates $A$ of size at most $k$ and returns an estimate of the number of unseparated pairs in $A$ up to a $(1\pmε)$ factor if it is at least $α{n \choose 2}$. We show that even for constant $α$ and success probability, such a sketching algorithm must use $Ω(mk \log ε^{-1})$ bits of space; on the other hand, uniform sampling yields a sketch of size $Θ(\frac{mk \log m}{αε^2})$ for this purpose.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源