论文标题
EMOGI:GPU中的无内存图形传播的有效内存访问
EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs
论文作者
论文摘要
现代分析和推荐系统越来越基于图形数据,这些数据捕获了被分析的实体之间的关系。实用图的大小较大,提供了巨大的并行性,并以稀疏的矩阵格式(例如CSR)存储。为了利用大规模的并行性,开发人员越来越有兴趣将GPU用于图形遍历。但是,由于它们的尺寸,图通常不适合GPU内存。先前的工作已经使用了输入数据预处理/分区或UVM将数据块从主机存储器迁移到GPU内存。但是,图形数据的较大,多维和稀疏性质给这些方案带来了一个主要挑战,并导致数据流动的显着扩增并减少了有效的数据吞吐量。在这项工作中,我们提出了EMOGI,这是一种使用直接的Cacheline尺寸访问存储在主机存储器中的数据的跨度图的替代方法。本文解决了一个开放的问题,即是否可以维持足够多的重叠的Cacheline大小访问权限到1)容忍长期的宿主记忆延迟,2)完全利用可用的带宽,3)实现有利的执行性能。我们使用FPGA通过PCIE分析了GPU中几个图形遍历应用的数据访问模式,以了解外部带宽利用率不佳的原因。通过仔细合并和对齐外部内存请求,我们表明我们可以最大程度地减少PCIE交易的数量,即使可以直接访问主机存储器,即使直接cache-line访问也可以完全利用PCIE带宽。与在各种图形遍历应用中优化的UVM实现相比,EMOGI平均达到2.92 $ \ times $速度。我们还表明,当系统使用较高的带宽互连(例如PCIE 4.0)时,EMOGI比基于UVM的解决方案更好。
Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as CSR. To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs often do not fit into the GPU memory. Prior works have either used input data pre-processing/partitioning or UVM to migrate chunks of data from the host memory to the GPU memory. However, the large, multi-dimensional, and sparse nature of graph data presents a major challenge to these schemes and results in significant amplification of data movement and reduced effective data throughput. In this work, we propose EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cacheline-sized access to data stored in host memory. This paper addresses the open question of whether a sufficiently large number of overlapping cacheline-sized accesses can be sustained to 1) tolerate the long latency to host memory, 2) fully utilize the available bandwidth, and 3) achieve favorable execution performance. We analyze the data access patterns of several graph traversal applications in GPU over PCIe using an FPGA to understand the cause of poor external bandwidth utilization. By carefully coalescing and aligning external memory requests, we show that we can minimize the number of PCIe transactions and nearly fully utilize the PCIe bandwidth even with direct cache-line accesses to the host memory. EMOGI achieves 2.92$\times$ speedup on average compared to the optimized UVM implementations in various graph traversal applications. We also show that EMOGI scales better than a UVM-based solution when the system uses higher bandwidth interconnects such as PCIe 4.0.