异步分布式内存三角计数和LCC带有RMA缓存

论文标题

异步分布式内存三角计数和LCC带有RMA缓存

Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching

论文作者

Strausz, András, Vella, Flavio, Di Girolamo, Salvatore, Besta, Maciej, Hoefler, Torsten

论文摘要

三角计数和局部聚类系数是两个用于图形分析的核心指标。他们在分析中发现了广泛的应用，例如社区检测和链接建议。当前的最新解决方案遭受分发图所需的同步开销或昂贵的预计，从而达到了有限的缩放功能。我们使用远程内存访问来传输数据并避免同步，我们建议基于1D分区的三角计数和本地聚类系数的完全异步实现。此外，我们展示了这些算法如何在远程内存访问中呈现数据重复使用，以及如何通过缓存这些访问来改善整体通信时间。最后，我们扩展了MPI RMA的软件层缓存系统Clampi，以包括用于缓存条目的应用特定分数，并影响驱逐程序以提高缓存效率。我们的结果显示了共享内存的改进，我们在分布式内存上的LiveJournal1图中实现了14倍的速度从4到64个节点。此外，我们演示了缓存远程访问的方式如何将总运行时间降低了73％，相对于非临床版本。最后，我们将我们的实施与2020年冠军纸Tric进行比较，并为无尺度图取得更快的结果。

Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asynchronous implementation for triangle counting and local clustering coefficient based on 1D partitioning, using remote memory accesses for transferring data and avoid synchronization. Additionally, we show how these algorithms present data reuse on remote memory accesses and how the overall communication time can be improved by caching these accesses. Finally, we extend CLaMPI, a software-layer caching system for MPI RMA, to include application-specific scores for cached entries and influence the eviction procedure to improve caching efficiency. Our results show improvements on shared memory, and we achieve 14x speedup from 4 to 64 nodes for the LiveJournal1 graph on distributed memory. Moreover, we demonstrate how caching remote accesses reduces total running time by up to 73% with respect to a non-cached version. Finally, we compare our implementation to TriC, the 2020 graph champion paper, and achieve up to 100x faster results for scale-free graphs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题