论文标题

网络上的索引数据:用于数据搜索的架构级索引的比较 - 扩展技术报告

Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search -- Extended Technical Report

论文作者

Blume, Till, Scherp, Ansgar

论文摘要

索引数据网络提供了许多机会,特别是找到和探索数据源。索引数据网时的一个主要设计决定是找到合适的索引模型,即如何索引和总结数据。已经努力为给定任务开发特定的索引模型。在设计,实施和评估的每个索引模型时,很难判断一种方法是否很好地概括了另一个任务,一组查询或数据集。在这项工作中,我们通过经验评估了具有独特功能组合的六个代表性索引模型。其中包括一种新的索引模型,该模型结合了RDF和OWL:SAMEAS的推断。我们首次将所有索引模型实施到一个基于流的框架中。我们评估了两个大型现实世界数据集上尺寸为0、1和2啤酒花的索引模型的变化。我们评估了有关压缩比,汇总比和F1得分的指数质量,以表示基于流索引计算的近似质量。该实验揭示了不同索引模型,查询和数据集的压缩比,汇总比和近似质量的巨大变化。但是,我们观察到结果中有意义的相关性,有助于确定给定任务,查询类型和数据集的正确索引模型。

Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains difficult to judge whether an approach generalizes well to another task, set of queries, or dataset. In this work, we empirically evaluate six representative index models with unique feature combinations. Among them is a new index model incorporating inferencing over RDFS and owl:sameAs. We implement all index models for the first time into a single, stream-based framework. We evaluate variations of the index models considering sub-graphs of size 0, 1, and 2 hops on two large, real-world datasets. We evaluate the quality of the indices regarding the compression ratio, summarization ratio, and F1-score denoting the approximation quality of the stream-based index computation. The experiments reveal huge variations in compression ratio, summarization ratio, and approximation quality for different index models, queries, and datasets. However, we observe meaningful correlations in the results that help to determine the right index model for a given task, type of query, and dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源