论文标题
对于非常大的数据库的可扩展阻塞
Scalable Blocking for Very Large Databases
论文作者
论文摘要
在数据库重复数据库的字段中,目标是在数据库中查找近似匹配的记录。阻止是此过程中的一个典型阶段,涉及廉价地找到候选记录对,这是进一步处理的潜在匹配。我们在这里提出了动态阻塞,这是一种新的阻塞方法,旨在解决比大多数先前工作中研究的数据集更大的数据集。哈希动态阻滞(HDB)扩展了动态阻塞,这利用了稀有匹配值和稀有值相交的见解可以预测匹配关系。我们还提出了对局部敏感散列(LSH)的新颖使用,以建立具有方便配置的巨大数据库的封闭关键值,以控制精度和召回之间的权衡。 HDB通过使用紧凑型块表示,并使用计数--Min草图近似计数数据结构来最大程度地减少数据移动,使用紧凑的块表示,从而实现大规模量表。我们通过关注超过100万行的现实世界数据集来基准算法,这表明该算法在此范围内显示线性时间复杂性缩放。此外,我们在一个5.3亿行工业数据集上执行HDB,在不到三个小时内检测到680亿候选人对,在主要的云服务中成本为307美元。
In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing. We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.