可扩展的多节点快速傅立叶变换GPU

论文标题

可扩展的多节点快速傅立叶变换GPU

Scalable Multi-node Fast Fourier Transform on GPUs

论文作者

Verma, Manthan, Chatterjee, Soumyadeep, Garg, Gaurav, Sharma, Bharatkumar, Arya, Nishant, Kumar, Shashi, Saxena, Anish, Verma, Mahendra K.

论文摘要

在本文中，我们介绍了多节点GPU-FFT库的详细信息，以及其在Selene HPC系统上的扩展。我们的图书馆采用平板分解来进行数据部门和MPI进行GPU之间的通信。我们以$ 1024^3 $，$ 2048^3 $和$ 4096^3 $网格使用最多512 A100 GPU进行了GPU-FFT。我们观察到$ 4096^3 $网格的良好缩放，64至512 GPU。我们报告说，与196608 Cray XC40核心$ 1536^3 $网格的多核FFT的时间与GPU-FFT $ 2048^3 $ GPU的GPU-FFT相当。 GPU-FFT的效率是由于A100卡的快速计算功能以及通过NVLink有效通信的效率。

In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on Selene HPC system. Our library employs slab decomposition for data division and MPI for communication among GPUs. We performed GPU-FFT on $1024^3$, $2048^3$, and $4096^3$ grids using a maximum of 512 A100 GPUs. We observed good scaling for $4096^3$ grid with 64 to 512 GPUs. We report that the timings of multicore FFT of $1536^3$ grid with 196608 cores of Cray XC40 is comparable to that of GPU-FFT of $2048^3$ grid with 128 GPUs. The efficiency of GPU-FFT is due to the fast computation capabilities of A100 card and efficient communication via NVlink.

下载PDF全文

下载文献需遵守相关版权规定

论文标题