MIDAS：来自现实世界超图的代表抽样

论文标题

MIDAS：来自现实世界超图的代表抽样

MiDaS: Representative Sampling from Real-world Hypergraphs

论文作者

Choe, Minyoung, Yoo, Jaemin, Lee, Geon, Baek, Woonsung, Kang, U, Shin, Kijung

论文摘要

图被广泛用于表示复杂系统中的成对相互作用。由于这样的现实图表很大且经常不断增长，因此对于各种目的，对小的代表子图进行采样是必不可少的：模拟，可视化，流处理，表示学习，爬行，仅举几例。但是，许多复杂的系统包括组互动（例如，研究人员的协作和在线问答平台的讨论），因此，与普通图相比，超图（即集合集合）可以更自然，准确地表示它们。由于大规模超图的流行，我们研究了来自现实世界中超图的代表性抽样问题，目的是回答（Q1）代表性的子杂志是什么是（Q2），我们如何在没有广泛搜索的情况下迅速找到代表性的人。关于Q1，我们建议通过将其与整个超图进行比较，以十个图级，超边级和节点级别的统计量来衡量亚杂志的好处。关于Q2，我们首先分析11种真实世界超图中六种直观方法的特征。然后，基于分析，我们提出了MIDAS，该MIDAS绘制了具有高度淋巴结的人的偏见。通过广泛的实验，我们证明了MIDAS是（a）代表性：在13种考虑的方法中找到总体代表性的样本，（b）快速：比最强大的竞争对手快几个数量级，该竞争对手进行了广泛的搜索，并且（c）自动：迅速搜索适当程度的偏见。

Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling a small representative subgraph is indispensable for various purposes: simulation, visualization, stream processing, representation learning, crawling, to name a few. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms), and thus they can be represented more naturally and accurately by hypergraphs (i.e., sets of sets) than by ordinary graphs. Motivated by the prevalence of large-scale hypergraphs, we study the problem of representative sampling from real-world hypergraphs, aiming to answer (Q1) what a representative sub-hypergraph is and (Q2) how we can find a representative one rapidly without an extensive search. Regarding Q1, we propose to measure the goodness of a sub-hypergraph by comparing it with the entire hypergraph in terms of ten graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first analyze the characteristics of six intuitive approaches in 11 real-world hypergraphs. Then, based on the analysis, we propose MiDaS, which draws hyperedges with a bias towards those with high-degree nodes. Through extensive experiments, we demonstrate that MiDaS is (a) Representative: finding overall the most representative samples among 13 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, which performs an extensive search, and (c) Automatic: rapidly searching a proper degree of bias.

下载PDF全文

下载文献需遵守相关版权规定

论文标题