论文标题
具有批处理线性求解器的性能便携式,完全隐含的Landau碰撞操作员
A performance portable, fully implicit Landau collision operator with batched linear solvers
论文作者
论文摘要
现代加速器使用层次平行编程模型,可在处理元素(PE)中启用大量的多线程,每个设备由传统过程驱动的每个设备都有多个PE。批处理是一种在传统上以MPI过程或单个过程中多个线程运行的算法中PE级并行性的技术。例如,出现了批处理的机会,例如,磁化等离子体的动力学离散化,在每个空间点,在每个空间点的速度空间中都会碰撞。 本文以先前的工作为基础,该论文是通过批处理求解器来进行高性能,完全非线性的Landau碰撞运算符,并批量空间点问题,并为多尺度多尺度,多物种问题的多个网格增加了新的支持。提出了一种与先前已发表的结果和分析模型非常吻合的各向异性放松验证测试。 NVIDIA A100和AMD MI250X节点的性能结果呈现为每个体系结构的硬件利用分析。整个隐式Landau运算符的时间提前在Kokkos中实现,以供性能可移植性,完全在设备上运行,可在PETSC数值库中使用。
Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently. This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library.