论文标题
数据摘要的指导探索
Guided Exploration of Data Summaries
论文作者
论文摘要
数据摘要是生成输入数据集的可解释和代表性子集的过程。通常在一次单次过程之后进行,目的是找到最佳的摘要。一个有用的摘要包含k个单独统一的集合,这些集集体多样化为代表性。统一性解决了解释性和多样性解决代表性。当数据高度多样化和大型时,查找摘要是一项艰巨的任务。我们研究了探索性数据分析(EDA)对数据摘要的适用性并正式化EDA4SUM,这是指导探索数据摘要的问题,该数据摘要试图依次生成连接的摘要,目的是最大程度地提高其累积效用。 EDA4SUM概括了一声总结。我们建议使用两种方法之一解决它:(i)TOP1SUM在每个步骤中选择最有用的摘要; (ii)RLSUM通过深入的强化学习训练政策,该政策奖励了一个代理,以在每个步骤中找到各种统一集合。我们将这些方法与一击摘要和表现最佳的EDA解决方案进行了比较。我们在三个大数据集上进行了广泛的实验。我们的结果表明,我们的方法对总结非常大数据的方法以及为领域专家提供指导的需求。
Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.