论文标题

基于图的关键字搜索在异质数据源中

Graph-based keyword search in heterogeneous data sources

论文作者

Haddad, Mhd Yamen, Anadiotis, Angelos, Mhd, Yamen, Manolescu, Ioana

论文摘要

数据新闻业是调查新闻领域,它通过将数字数据视为一流的公民来关注数字数据。遵循人类活动的趋势(留下强大的数字痕迹,数据新闻业变得越来越重要。但是,随着数据源的数量和多样性的增加,需要在查询答案中考虑具有不同结构,甚至根本没有结构的异质数据模型。受我们与法国领先报纸Le Monde的合作的启发,我们设计了一种新颖的查询算法,用于通过关键字搜索来利用这种异质语料库。我们将基本数据建模为图形,并且给定一组搜索词,我们的算法NDS链接在图中包含的异质数据集内和跨越它们之间。我们从结构化和非结构化数据中的关键字搜索上的先前工作中汲取灵感,并随数据异质性维度扩展,这使关键字搜索问题在计算上变得更加困难。我们实施算法,并使用合成和现实世界数据集评估其性能。

Data journalism is the field of investigative journalism which focuses on digital data by treating them as first-class citizens. Following the trends in human activity, which leaves strong digital traces, data journalism becomes increasingly important. However, as the number and the diversity of data sources increase, heterogeneous data models with different structure, or even no structure at all, need to be considered in query answering. Inspired by our collaboration with Le Monde, a leading French newspaper, we designed a novel query algorithm for exploiting such heterogeneous corpora through keyword search. We model our underlying data as graphs and, given a set of search terms, our algorithm nds links between them within and across the heterogeneous datasets included in the graph. We draw inspiration from prior work on keyword search in structured and unstructured data, which we extend with the data heterogeneity dimension, which makes the keyword search problem computationally harder. We implement our algorithm and we evaluate its performance using synthetic and real-world datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源