论文标题

关于机器学习模型有效的近似查询

On Efficient Approximate Queries over Machine Learning Models

论文作者

Ding, Dujian, Amer-Yahia, Sihem, Lakshmanan, Laks VS

论文摘要

回答有关ML预测的查询的问题在数据库社区中引起了人们的关注。这个问题是具有挑战性的,因为找到高质量答案的成本与呼唤甲骨文(例如人类专家)或昂贵的深度神经网络模型在DB中的每个项目中,然后应用查询。我们开发了一个新颖的统一框架,用于通过利用代理来最大程度地减少在Precision-target(PT)和Recce-Target(RT)查询中找到高质量答案的方法来最大程度地减少Oracle的用法。我们的框架使用明智的组合,可以在数据样本上调用昂贵的Oracle并在DB中的对象上应用廉价代理。它依赖两个假设。在代理质量假设下,可以以概率的方式量化代理质量W.R.T.甲骨文。这使我们能够开发两种算法:PQA,可以有效地找到具有高概率和无甲骨文调用的高质量答案,而PQE是一种启发式扩展,可以通过少量的Oracle调用实现经验上的良好性能。另外,在核心集封闭假设下,我们开发了两种算法:CSC,该算法可以有效地返回高质量的答案,并使用高概率和最小的甲骨文使用,以及将其扩展到更一般的设置。我们在两种查询类型的五个现实世界数据集上进行了广泛的实验,这表明我们的算法在最先进的情况下优于最先进,并具有可证明的统计保证。

The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源