从数据湖中发现的语义意识数据集发现具有上下文化的列表示学习

论文标题

从数据湖中发现的语义意识数据集发现具有上下文化的列表示学习

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

论文作者

Fan, Grace, Wang, Jin, Li, Yuliang, Zhang, Dan, Miller, Renée

论文摘要

在许多实际应用程序方案中，数据湖的数据集发现至关重要。在本文中，我们提出了Starmie，这是数据湖区发现数据集发现的端到端框架（以Table Union搜索为主要用例）。我们提出的框架采用了一种对比学习方法，可以完全无监督的方式训练从预训练的语言模型中编码列编码。 Starmie的列编码器通过利用对比度的多列前训练策略来捕获表中丰富的上下文语义信息。我们利用列嵌入向量之间的余弦相似性作为列的联合性分数，并提出了一个滤波器和验证框架，该框架允许探索各种设计选择，以相应地计算两个表之间的联合性分数。真实表基准数据集的经验评估结果表明，Starmie在地图和召回中以6.8的效果比最著名的解决方案优于最著名的解决方案。此外，Starmie是第一个采用HNSW（分层可通道的小世界）索引来加速桌子联合搜索的查询处理，该搜索比线性扫描基线可获得3,000倍的性能增长，而在LSH索引（用于数据湖指数的最先进的解决方案）中，可提供400倍的性能增长）。

Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).

下载PDF全文

下载文献需遵守相关版权规定

论文标题