论文标题
图形数据上的放松关系查询
Relaxing Relationship Queries on Graph Data
论文作者
论文摘要
在许多域中,我们见证了需要搜索大型实体关系图,以找到查询中指定的一组实体之间的直接和间接关系。称为语义关联(SA)的搜索结果通常是包含所有查询实体的连接子图的紧凑(例如,直径约束)。对于SA搜索问题,存在有效的算法,但是如果某些查询实体在图中遥远,则将返回空结果。为了减少查询失败的发生并提供其他结果,我们研究了在SA搜索的背景下查询放松的问题。简单地放松紧凑的约束将牺牲SA的紧凑性,更重要的是,可能导致性能问题并不切实际。取而代之的是,我们专注于从原始失败查询中删除最小数量的实体,以形成最大成功的子问题,从而最大程度地减少了由于放松而导致的结果质量的损失。我们证明,验证子问题的成功变成找到满足有关查询实体的基于距离条件的实体(称为证书)。为了有效地找到最大子问题成功证书,我们提出了一种最佳优先搜索算法,该算法利用基于距离的估计来有效修剪搜索空间。我们通过添加两个细粒启发式方法进一步提高了其性能:一个基于学位,另一个基于距离。对流行的RDF数据集进行的广泛实验证明了我们的算法的效率,该算法比基准更可扩展。
In many domains we have witnessed the need to search a large entity-relation graph for direct and indirect relationships between a set of entities specified in a query. A search result, called a semantic association (SA), is typically a compact (e.g., diameter-constrained) connected subgraph containing all the query entities. For this problem of SA search, efficient algorithms exist but will return empty results if some query entities are distant in the graph. To reduce the occurrence of failing query and provide alternative results, we study the problem of query relaxation in the context of SA search. Simply relaxing the compactness constraint will sacrifice the compactness of an SA, and more importantly, may lead to performance issues and be impracticable. Instead, we focus on removing the smallest number of entities from the original failing query, to form a maximum successful sub-query which minimizes the loss of result quality caused by relaxation. We prove that verifying the success of a sub-query turns into finding an entity (called a certificate) that satisfies a distance-based condition about the query entities. To efficiently find a certificate of the success of a maximum sub-query, we propose a best-first search algorithm that leverages distance-based estimation to effectively prune the search space. We further improve its performance by adding two fine-grained heuristics: one based on degree and the other based on distance. Extensive experiments over popular RDF datasets demonstrate the efficiency of our algorithm, which is more scalable than baselines.