论文标题
量化数据的索引的价格:对学习指数结构的中毒攻击
The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures
论文作者
论文摘要
学习索引结构的概念取决于可以将数据库索引的输入输出功能视为预测任务,因此可以使用机器学习模型而不是传统算法技术来实现。这个新的角度已经有了数十年的问题,激发了机器学习与数据结构的交集的众多令人兴奋的结果。但是,学到的索引结构的主要优点,即通过基础ML模型调整到手头的数据的能力,可以从安全角度来看,因为它可以利用它。 在这项工作中,我们介绍了对学习指数结构中毒攻击的首次研究。所需的中毒方法与以前的所有工作不同,因为在攻击下进行了累积分布函数(CDF)训练,因此,训练集对训练集的每次注入都会对多个数据值产生级联影响。我们对在CDF上训练的线性回归模型上制定了第一次中毒攻击,这是拟议中学的索引结构的基本构建块。我们将中毒技术概括为攻击称为递归模型指数(RMI)的更先进的两阶段索引结构的设计,该结构已被证明超过了传统的B-Trees。我们在模型的各种参数下评估了对现实世界和合成数据集的攻击,并表明RMI的误差增加到$ 300 \ times $ $,其第二阶段模型的错误增加了$ 3000 \ times $。
The concept of learned index structures relies on the idea that the input-output functionality of a database index can be viewed as a prediction task and, thus, be implemented using a machine learning model instead of traditional algorithmic techniques. This novel angle for a decades-old problem has inspired numerous exciting results in the intersection of machine learning and data structures. However, the main advantage of learned index structures, i.e., the ability to adjust to the data at hand via the underlying ML-model, can become a disadvantage from a security perspective as it could be exploited. In this work, we present the first study of poisoning attacks on learned index structures. The required poisoning approach is different from all previous works since the model under attack is trained on a cumulative distribution function (CDF) and, thus, every injection on the training set has a cascading impact on multiple data values. We formulate the first poisoning attacks on linear regression models trained on the CDF, which is a basic building block of the proposed learned index structures. We generalize our poisoning techniques to attack a more advanced two-stage design of learned index structures called recursive model index (RMI), which has been shown to outperform traditional B-Trees. We evaluate our attacks on real-world and synthetic datasets under a wide variety of parameterizations of the model and show that the error of the RMI increases up to $300\times$ and the error of its second-stage models increases up to $3000\times$.