论文标题

与帕克一致的数据融合

Consistent data fusion with Parker

论文作者

Bronselaer, Antoon, Acosta, Maribel

论文摘要

当组合来自多个来源的数据时,数据不一致会使相干结果的产生复杂化。在本文中,我们在部分密钥(EPK)下引入了一种称为“编辑规则”的新类型的约束。这些约束可以模拟来源内部和源之间的不一致性,但在松散耦合的物质中。我们表明,我们可以将众所周知的集覆盖方法调整为EPK的设置,这产生了有效的算法,以找到最小的成本维修来源。该算法是在称为Parker的维修引擎中实现的。经验结果表明,帕克比最先进的维修工具快几个数量级。同时,与这些工具相比,维修质量的质量从$ f_1 $ - 分数范围从可比到更好。

When combining data from multiple sources, inconsistent data complicates the production of a coherent result. In this paper, we introduce a new type of constraints called edit rules under a partial key (EPKs). These constraints can model inconsistencies both within and between sources, but in a loosely-coupled matter. We show that we can adapt the well-known set cover methodology to the setting of EPKs and this yields an efficient algorithm to find minimal cost repairs of sources. This algorithm is implemented in a repair engine called Parker. Empirical results show that Parker is several orders of magnitude faster than state-of-the-art repair tools. At the same time, the quality of the repairs in terms of $F_1$-score ranges from comparable to better compared to these tools.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源