关系数据和应用程序的核心

论文标题

关系数据和应用程序的核心

Coresets for Relational Data and The Applications

论文作者

Chen, Jiaxiang, Yang, Qingyuan, Huang, Ruomin, Ding, Hu

论文摘要

核心是一个很小的集合，可以大致保留原始输入数据集的结构。因此，我们可以在核心上运行算法，从而降低总计算复杂性。常规的核心技术假设输入数据集可以明确处理。但是，这种假设可能无法在实际情况下存在。在本文中，我们考虑了核心构建的问题，而不是关系数据。也就是说，将数据分解为几个关系表，并且通过连接表直接实现数据矩阵可能非常昂贵。我们提出了一种名为“用伪立方体的聚合树”的新颖方法，该方法可以从底部到向上建立一个核心。此外，我们的方法可以整洁地阐明关系学习问题的几个麻烦问题[Khamis等，Pods 2019]。在一些温和的假设下，我们表明我们的核心方法可以应用于机器学习任务，例如聚类，逻辑回归和SVM。

A coreset is a small set that can approximately preserve the structure of the original input data set. Therefore we can run our algorithm on a coreset so as to reduce the total computational complexity. Conventional coreset techniques assume that the input data set is available to process explicitly. However, this assumption may not hold in real-world scenarios. In this paper, we consider the problem of coresets construction over relational data. Namely, the data is decoupled into several relational tables, and it could be very expensive to directly materialize the data matrix by joining the tables. We propose a novel approach called ``aggregation tree with pseudo-cube'' that can build a coreset from bottom to up. Moreover, our approach can neatly circumvent several troublesome issues of relational learning problems [Khamis et al., PODS 2019]. Under some mild assumptions, we show that our coreset approach can be applied for the machine learning tasks, such as clustering, logistic regression and SVM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题