论文标题

利用模式标签来增强数据集搜索

Leveraging Schema Labels to Enhance Dataset Search

论文作者

Chen, Zhiyu, Jia, Haiyan, Heflin, Jeff, Davison, Brian D.

论文摘要

搜索引擎检索理想数据集的能力对于数据共享和重复使用非常重要。现有的数据集搜索引擎通常依赖于将查询与数据集说明匹配。但是,用户可能没有足够的先验知识来使用与描述文本匹配的术语编写查询。我们提出了一种新型的模式标签生成模型,该模型基于数据集表内容生成可能的模式标签。我们将生成的模式标签纳入混合排名模型中,该模型不仅考虑了查询和数据集元数据之间的相关性,还考虑了查询和生成的模式标签之间的相似性。为了评估我们在现实世界数据集上的方法,我们为数据集检索任务创建了一个新的基准标准。实验表明,与基线方法相比,我们的方法可以有效地提高数据集检索任务的精度和NDCG得分。我们还测试了Wikipedia表的集合,以表明模式标签生成的功能也可以改善无监督和监督的Web Table检索任务。

A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text.We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源