论文标题

一个基于I/O高效磁盘的图形系统,用于可扩展的二阶随机步行大图

An I/O-Efficient Disk-based Graph System for Scalable Second-Order Random Walk of Large Graphs

论文作者

Li, Hongzheng, Shao, Yingxia, Du, Junping, Cui, Bin, Chen, Lei

论文摘要

随机步行广泛用于许多图形分析任务,尤其是一阶随机步行。但是,作为简化现实世界问题的简化,一阶随机步行在对数据中的高阶结构进行建模时很差。最近,二阶随机步行应用程序(例如Node2Vec,二阶Pagerank)变得有吸引力。由于二阶随机步行模型和内存限制的复杂性,在单台计算机上运行二阶随机步行应用程序是无法扩展的。现有的基于磁盘的图形系统仅对一阶随机步行型号友好,并且在执行二阶随机步行时会遭受昂贵的磁盘I/OS。本文引入了一个基于I/O高效磁盘的图形系统,用于可扩展的二阶随机步行大图,称为Grasorw。首先,为了消除大量的Light顶点I/OS,我们开发了一个双块执行引擎,该引擎通过应用新的三角形双块调度策略,基于桶的步行管理和偏斜的步行存储,将随机I/OS转换为顺序I/OS。其次,为了改善I/O利用率,我们设计了一个基于学习的块加载模型,以利用全负载和按需负载方法的优势。最后,我们对六个大型真实数据集以及几个合成数据集进行了广泛的实验。经验结果表明,与现有的基于磁盘的图形系统相比,Grasorw中流行任务的端到端时间成本减少了多个数量级。

Random walk is widely used in many graph analysis tasks, especially the first-order random walk. However, as a simplification of real-world problems, the first-order random walk is poor at modeling higher-order structures in the data. Recently, second-order random walk-based applications (e.g., Node2vec, Second-order PageRank) have become attractive. Due to the complexity of the second-order random walk models and memory limitations, it is not scalable to run second-order random walk-based applications on a single machine. Existing disk-based graph systems are only friendly to the first-order random walk models and suffer from expensive disk I/Os when executing the second-order random walks. This paper introduces an I/O-efficient disk-based graph system for the scalable second-order random walk of large graphs, called GraSorw. First, to eliminate massive light vertex I/Os, we develop a bi-block execution engine that converts random I/Os into sequential I/Os by applying a new triangular bi-block scheduling strategy, the bucket-based walk management, and the skewed walk storage. Second, to improve the I/O utilization, we design a learning-based block loading model to leverage the advantages of the full-load and on-demand load methods. Finally, we conducted extensive experiments on six large real datasets as well as several synthetic datasets. The empirical results demonstrate that the end-to-end time cost of popular tasks in GraSorw is reduced by more than one order of magnitude compared to the existing disk-based graph systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源