论文标题
在没有完整加入计算的情况下发现多桌功能依赖
Discovering Multi-Table Functional Dependencies Without Full Join Computation
论文作者
论文摘要
在本文中,我们研究了发现加入FD的问题,即在多个连接表上持有的功能依赖性(FD)。我们利用逻辑推理,选择性挖掘和取样,并表明我们可以从参与加入的单个表中发现大多数确切的联接FD,并避免对连接结果的完整计算。我们提出算法以加速联接FD发现过程,并仅通过必要的数据分区即时地挖掘FD。我们介绍了Jedi(加入依赖性发现),这是我们的解决方案,以发现加入FD,而无需预先计算完整的联接。我们对一系列现实世界和合成数据的实验证明了我们方法比现有的FD发现方法的好处,这些方法需要在发现FD之前先预先计算联接结果。我们表明,性能取决于连接属性值的红衣主教和覆盖范围:对于覆盖范围低的联接操作,JEDI具有选择性采矿的胜地,使用了一个直接的加入计算方法,即在运行时使用一个直接加入计算的直接方法,并且可以在运行时进行一个数量级,并且可以在均衡的一半执行时间内的逻辑上的精确连接方面,以四分之一的态度发现精确的联接FDS。对于更高的联接覆盖范围,带有采样的绝地武士达到1的精度,平均只有63%的表输入大小。
In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join result. We propose algorithms to speed-up the join FD discovery process and mine FDs on the fly only from necessary data partitions. We introduce JEDI (Join dEpendency DIscovery), our solution to discover join FDs without computation of the full join beforehand. Our experiments on a range of real-world and synthetic data demonstrate the benefits of our method over existing FD discovery methods that need to precompute the join results before discovering the FDs. We show that the performance depends on the cardinalities and coverage of the join attribute values: for join operations with low coverage, JEDI with selective mining outperforms the competing methods using the straightforward approach of full join computation by one order of magnitude in terms of runtime and can discover three-quarters of the exact join FDs using mainly logical inference in half of its total execution time on average. For higher join coverage, JEDI with sampling reaches precision of 1 with only 63% of the table input size on average.