可扩展的数据跳动

论文标题

可扩展的数据跳动

Extensible Data Skipping

论文作者

Ta-Shma, Paula, Khazma, Guy, Lushi, Gal, Feder, Oshrit

论文摘要

通过基于其元数据跳过无关的数据对象（文件），数据跳过可减少SQL查询的I/O。我们通过允许开发人员使用灵活的API定义自己的数据跳过元数据类型和索引来扩展此概念。我们的框架是第一个使用用户定义的功能（UDFS）的任意数据类型（例如地理空间，日志）和查询的任意数据类型（例如地理空间，日志）的框架。我们将框架与Apache Spark集成在一起，现在它已在IBM的多个产品/服务中部署。我们介绍了可扩展的数据跳过API，讨论索引设计并实施各种元数据索引，每个索引只需要大约30行额外的代码。特别是，我们为使用地理空间UDF的第三方库实施了数据跳过，并演示了两个数量级的加速。与重写以利用Parquet Min/Max Metadata的查询相比，我们的集中式元数据方法即使是X3.6的速度。我们证明，可扩展的数据跳动适用于广泛的应用程序，在这些应用程序中，用户定义的索引可以以非常低的开发成本实现大量加速和成本节省。

Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题