使用OpenACC和MPI的多GPU热晶格玻尔兹曼模拟

论文标题

使用OpenACC和MPI的多GPU热晶格玻尔兹曼模拟

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

论文作者

Xu, Ao, Li, Bo-Tao

论文摘要

我们评估了混合开放加速器（OpenACC）的性能和消息传递接口（MPI）方法，用于多绘画处理单元（GPU）加速热晶格Boltzmann（LB）仿真。 OpenACC在单个GPU上加速了计算，并且MPI同步了多个GPU之间的信息。借助单个GPU，二维（2D）模拟每秒获得了190亿个格子更新（glups），网格数量为8193^{2} $，三维（3D）模拟获得了1.04个glups，其网格数量为$ 385^} $ 385^} $，比76％均超过76％，该均为76％。在Multi-GPU上，我们采用块分区，与计算重叠的通信以及并发计算以优化并行效率。我们表明，在使用16个GPU的强缩放测试中，2D模拟达到了30.42个glup，而3D模拟达到了14.52个glup。在弱缩放测试中，平行效率保持在99％以上，高达16 GPU。我们的结果表明，通过改进的数据和任务管理，混合openACC和MPI技术有望用于多基金会的热LB模拟。

We assess the performance of the hybrid Open Accelerator (OpenACC) and Message Passing Interface (MPI) approach for multi-graphics processing units (GPUs) accelerated thermal lattice Boltzmann (LB) simulation. The OpenACC accelerates computation on a single GPU, and the MPI synchronizes the information between multiple GPUs. With a single GPU, the two-dimension (2D) simulation achieved 1.93 billion lattice updates per second (GLUPS) with a grid number of $8193^{2}$, and the three-dimension (3D) simulation achieved 1.04 GLUPS with a grid number of $385^{3}$, which is more than 76% of the theoretical maximum performance. On multi-GPUs, we adopt block partitioning, overlapping communications with computations, and concurrent computation to optimize parallel efficiency. We show that in the strong scaling test, using 16 GPUs, the 2D simulation achieved 30.42 GLUPS and the 3D simulation achieved 14.52 GLUPS. In the weak scaling test, the parallel efficiency remains above 99% up to 16 GPUs. Our results demonstrated that, with improved data and task management, the hybrid OpenACC and MPI technique is promising for thermal LB simulation on multi-GPUs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题