高维数据的一致和灵活的选择性估计

论文标题

高维数据的一致和灵活的选择性估计

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

论文作者

Wang, Yaoshu, Xiao, Chuan, Qin, Jianbin, Mao, Rui, Makoto, Onizuka, Wang, Wei, Zhang, Rui, Ishikawa, Yoshiharu

论文摘要

选择性估计旨在估计满足选择标准的数据库对象的数量。准确有效地回答此问题对于许多应用程序，例如密度估计，离群检测，查询优化和数据集成至关重要。估计问题对于大规模高维数据尤其具有挑战性，这是由于维数的诅咒，跨不同查询的选择性的较大差异以及使估计器一致的需求（即，选择性在阈值中不受约束）。我们提出了一个新的基于深度学习的模型，该模型将学习依赖性的分段线性函数作为选择性估计器，该函数是灵活的，可以符合任何距离函数和查询对象的选择性曲线，同时确保输出在阈值中不受约束。为了提高大型数据集的准确性，我们建议将数据集划分为多个不相交的子集并在每个分类集中构建本地模型。我们在实际数据集上执行实验，并表明所提出的模型以有效的方式准确地超过了最先进的模型，并且对实际应用程序很有用。

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-dimensional data due to the curse of dimensionality, the large variance of selectivity across different queries, and the need to make the estimator consistent (i.e., the selectivity is non-decreasing in the threshold). We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator, which is flexible to fit the selectivity curve of any distance function and query object, while guaranteeing that the output is non-decreasing in the threshold. To improve the accuracy for large datasets, we propose to partition the dataset into multiple disjoint subsets and build a local model on each of them. We perform experiments on real datasets and show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way and is useful for real applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题