论文标题

重塑:自适应结果感知偏斜处理大数据的探索性分析

Reshape: Adaptive Result-aware Skew Handling for Exploratory Analysis on Big Data

论文作者

Kumar, Avinash, Alsudais, Sadeem, Ni, Shengquan, Wang, Zuozhi, Huang, Yicong, Li, Chen

论文摘要

数据分析的过程,尤其是在基于GUI的分析系统中,具有高度探索性。在到达最终工作流程之前,用户多次迭代地完善工作流程。在这样的探索环境中,如果工作流的初始结果代表最终答案,则对用户很有价值,因此用户可以在不等待完成执行的情况下完善工作流程。划分偏斜可能导致在执行过程中产生误导性的初始结果。在本文中,我们从向用户显示的结果的角度来探索偏斜及其缓解策略。我们提出了一个名为Reshape的新颖框架,该框架可以自适应地处理管道执行中的分区。 RESHAPE采用一种两相方法,以微调的方式转移负载,以减轻执行过程中的偏斜,从而使其能够处理输入数据分布的变化。 Reshape具有适应性调整偏斜处理参数的能力,从而减轻了用户的技术负担。重塑支持各种运营商,例如HashJoin,group-by和sort。我们在两个大数据引擎(即Amber和Flink)上实施了重塑,以证明其通用性和效率,并使用真实和合成数据集进行了实验评估。

The process of data analysis, especially in GUI-based analytics systems, is highly exploratory. The user iteratively refines a workflow multiple times before arriving at the final workflow. In such an exploratory setting, it is valuable to the user if the initial results of the workflow are representative of the final answers so that the user can refine the workflow without waiting for the completion of its execution. Partitioning skew may lead to the production of misleading initial results during the execution. In this paper, we explore skew and its mitigation strategies from the perspective of the results shown to the user. We present a novel framework called Reshape that can adaptively handle partitioning skew in pipelined execution. Reshape employs a two-phase approach that transfers load in a fine-tuned manner to mitigate skew iteratively during execution, thus enabling it to handle changes in input-data distribution. Reshape has the ability to adaptively adjust skew-handling parameters, which reduces the technical burden on the users. Reshape supports a variety of operators such as HashJoin, Group-by, and Sort. We implemented Reshape on top of two big data engines, namely Amber and Flink, to demonstrate its generality and efficiency, and report an experimental evaluation using real and synthetic datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源