关于来自具有重复记录的数据的采样

论文标题

关于来自具有重复记录的数据的采样

On sampling from data with duplicate records

论文作者

Heidari, Alireza, Kushagra, Shrinu, Ilyas, Ihab F.

论文摘要

数据删除是检测与同一现实世界实体相对应的数据库中记录的任务。我们的目标是制定一个程序，该程序是从数据库中存在的一组实体中均匀地采样的。我们通过两个阶段的过程来完成此操作。在第一步中，我们估计数据库中所有实体的频率。在第二步中，我们使用拒绝抽样从一组实体中获得（大约）均匀的样本。但是，有效估计所有实体的频率是一项非平凡的任务，在一般情况下无法实现。因此，我们考虑了数据的各种自然特性，在这些数据中，这种频率估计（以及因此均匀的采样）是可能的。在每个假设中，我们提供采样算法，并证明我们方法的复杂性（统计和计算）。我们通过对真实和合成数据集进行广泛的实验来补充我们的研究。

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题