显式缓存HYB：GPGPU上的新型高性能SPMV框架

论文标题

显式缓存HYB：GPGPU上的新型高性能SPMV框架

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

论文作者

Chen, Chong

论文摘要

稀疏矩阵矢量乘法（SPMV）是计算机模拟中有限元方法的迭代求解器的关键操作。由于SPMV操作是一种记忆结合的算法，因此数据移动的效率在很大程度上影响了SPMV在GPU上的性能。近年来，进行了许多研究，以加速SPMV在图形处理单元（GPU）上的性能。现有研究中使用的性能优化方法集中于以下领域：改善GPU处理器之间的负载平衡，并减少GPU线程之间的执行差异。尽管一些研究已经对输入矢量提取进行了初步优化，但是尚未深入研究明确缓存输入矢量对GPU碱基SPMV的影响。在这项研究中，我们试图使用名为“显式缓存混合动力车（EHYB）”的新框架来最大程度地降低基于GPU的SPMV的数据运动成本。通过使用以下方法，EHYB框架可以通过分区和明确缓存输入向量到CUDA内核的共享内存来提高数据移动的速度。 2。通过以紧凑的格式存储列索引的主要部分来减少数据移动的量。我们用在不同领域的FEM应用中得出的稀疏矩阵测试了实施。实验结果表明，我们的实施可能会超出最先进的实现，并带来明显的加速实施，并且比现有基于GPU的SPMV实现的理论实现的上限高。

Sparse Matrix-Vector Multiplication (SpMV) is a critical operation for the iterative solver of Finite Element Methods on computer simulation. Since the SpMV operation is a memory-bound algorithm, the efficiency of data movements heavily influenced the performance of the SpMV on GPU. In recent years, many research is conducted in accelerating the performance of SpMV on the graphic processing units (GPU). The performance optimization methods used in existing studies focus on the following areas: improve the load balancing between GPU processors, and reduce the execution divergence between GPU threads. Although some studies have made preliminary optimization on the input vector fetching, the effect of explicitly caching the input vector on GPU base SpMV has not been studied in depth yet. In this study, we are trying to minimize the data movements cost for GPU-based SpMV using a new framework named "explicit caching Hybrid (EHYB)". The EHYB framework achieved significant performance improvement by using the following methods: 1. Improve the speed of data movements by partitioning and explicitly caching the input vector to the shared memory of the CUDA kernel. 2. Reduce the volume of data movements by storing the major part of the column index with a compact format. We tested our implementation with sparse matrices derived from FEM applications in different areas. The experiment results show that our implementation can overperform the state-of-the-arts implementation with significant speedup, and leads to higher FLOPs than the theoryperformance up-boundary of the existing GPU-based SpMV implementations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题