论文标题
关系数据Borg正在学习
The Relational Data Borg is Learning
论文作者
论文摘要
本文概述了一种方法,该方法将与关系数据的机器学习作为数据库问题。这是通过两个观察结果证明的。首先,学习任务的输入通常是关系数据上特征提取查询的结果。其次,学习任务需要计算小组聚集体。 已经针对许多受监督和无监督的学习任务进行了研究,包括:脊线性回归,分解机,支持向量机,决策树,主要成分分析和K-均值;以及有关数据矩阵的线性代数。 这项工作的主要信息是,可以通过利用基础数据知识的技术框来显着提高机器学习的运行时间性能。这包括关于代数,组合和统计结构的理论开发,即关于代码专业化,低级计算共享和并行化的关系数据处理和系统开发的理论发展。这些技术旨在降低学习时间的复杂性和恒定因素。 This work is the outcome of extensive collaboration of the author with colleagues from RelationalAI, in particular Mahmoud Abo Khamis, Molham Aref, Hung Ngo, and XuanLong Nguyen, and from the FDB research project, in particular Ahmet Kara, Milos Nikolic, Maximilian Schleich, Amir Shaikhha, Jakub Zavodny, and Haozhe Zhang.作者还要感谢FDB项目的成员提供本文中使用的数字和示例。 作者感谢行业的支持:亚马逊网络服务,Google,Infor,Logicblox,Microsoft Azure,Reliathationai;以及资助机构EPSRC和ERC。根据682588授予协议,该项目已从欧盟Horizon 2020研究与创新计划获得资金。
This paper overviews an approach that addresses machine learning over relational data as a database problem. This is justified by two observations. First, the input to the learning task is commonly the result of a feature extraction query over the relational data. Second, the learning task requires the computation of group-by aggregates. This approach has been already investigated for a number of supervised and unsupervised learning tasks, including: ridge linear regression, factorisation machines, support vector machines, decision trees, principal component analysis, and k-means; and also for linear algebra over data matrices. The main message of this work is that the runtime performance of machine learning can be dramatically boosted by a toolbox of techniques that exploit the knowledge of the underlying data. This includes theoretical development on the algebraic, combinatorial, and statistical structure of relational data processing and systems development on code specialisation, low-level computation sharing, and parallelisation. These techniques aim at lowering both the complexity and the constant factors of the learning time. This work is the outcome of extensive collaboration of the author with colleagues from RelationalAI, in particular Mahmoud Abo Khamis, Molham Aref, Hung Ngo, and XuanLong Nguyen, and from the FDB research project, in particular Ahmet Kara, Milos Nikolic, Maximilian Schleich, Amir Shaikhha, Jakub Zavodny, and Haozhe Zhang. The author would also like to thank the members of the FDB project for the figures and examples used in this paper. The author is grateful for support from industry: Amazon Web Services, Google, Infor, LogicBlox, Microsoft Azure, RelationalAI; and from the funding agencies EPSRC and ERC. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 682588.