论文标题
弥合现实与实体匹配的理想性之间的差距:重新审视和基准重建
Bridging the Gap between Reality and Ideality of Entity Matching: A Revisiting and Benchmark Re-Construction
论文作者
论文摘要
实体匹配(EM)是实体分辨率(ER)的最关键步骤。尽管当前的基于深度学习的方法在标准EM基准上取得了令人印象深刻的性能,但它们的Realworld应用程序性能令人沮丧。在本文中,我们强调说,现实与理想之间的差距源于不合理的基准构建过程,这与实体匹配的性质不一致,因此导致对当前EM方法的评估有偏见。为此,我们构建了一个新的EM语料库,并重新构建了EM基准测试,以通过逐步更改限制性实体,平衡的标签和单模式记录,以挑战以前基准构建过程中隐含的关键假设,并在开放环境中将以前的基准记录转换为开放实体,不平衡标签和多态记录。实验结果表明,以前基准施工过程中做出的假设与开放环境并不偶然,该环境掩盖了任务的主要挑战,因此大大高估了实体匹配的当前进度。构建的基准和代码已公开发布
Entity matching (EM) is the most critical step for entity resolution (ER). While current deep learningbased methods achieve very impressive performance on standard EM benchmarks, their realworld application performance is much frustrating. In this paper, we highlight that such the gap between reality and ideality stems from the unreasonable benchmark construction process, which is inconsistent with the nature of entity matching and therefore leads to biased evaluations of current EM approaches. To this end, we build a new EM corpus and re-construct EM benchmarks to challenge critical assumptions implicit in the previous benchmark construction process by step-wisely changing the restricted entities, balanced labels, and single-modal records in previous benchmarks into open entities, imbalanced labels, and multimodal records in an open environment. Experimental results demonstrate that the assumptions made in the previous benchmark construction process are not coincidental with the open environment, which conceal the main challenges of the task and therefore significantly overestimate the current progress of entity matching. The constructed benchmarks and code are publicly released