论文标题
基于平行索引的结构图聚类及其近似
Parallel Index-Based Structural Graph Clustering and Its Approximation
论文作者
论文摘要
扫描(网络的结构聚类算法)是一种经过良好使用的,广泛使用的图形聚类算法。但是,对于大图,顺序扫描变体非常缓慢,并且并行扫描变体不能有效地在具有不同扫描参数设置的查询之间共享工作。由于扫描的用户通常会探索许多参数设置以找到良好的聚类,因此值得预先计算加快查询的索引。 本文介绍了一种基于GS*index的实用且有效的基于平行指数的扫描算法,这是一种最近的顺序算法。我们的平行算法通过使用整数分选来改善顺序算法的渐近工作。它也是高度平行的,可以实现索引构造和聚类查询的对数跨度(并行时间)。此外,我们应用对局部敏感的哈希(LSH)来设计一种新型的近似扫描算法,并证明其聚类行为可以保证。 我们对大型现实图表上的算法进行了实验评估。在带有双向超线程的48核机上,我们的平行索引构造实现了50--151 $ \ times $ $加速,而GS* - 索引的构建。实际上,即使在一个线程上,我们的索引构造算法也比GS*索引快。我们的并行索引查询实现实现了5--32 $ \ times $在一系列扫描参数值上查询的速度,并且我们的实现始终比PPSCAN(一种最先进的并行扫描算法)更快。此外,我们的实验表明,应用LSH会导致索引构建更快,同时保持良好的聚类质量。
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel, achieving logarithmic span (parallel time) for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering behavior. We present an experimental evaluation of our algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50--151$\times$ speedup over the construction of GS*-Index. In fact, even on a single thread, our index construction algorithm is faster than GS*-Index. Our parallel index query implementation achieves 5--32$\times$ speedup over GS*-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-the-art parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in faster index construction while maintaining good clustering quality.