阶段 - 平行完全隐含的runge-kutta实现，具有最佳的多级预处理，以缩放限制

论文标题

阶段 - 平行完全隐含的runge-kutta实现，具有最佳的多级预处理，以缩放限制

Stage-parallel fully implicit Runge-Kutta implementations with optimal multilevel preconditioners at the scaling limit

论文作者

Munch, Peter, Dravins, Ivo, Kronbichler, Martin, Neytcheva, Maya

论文摘要

我们介绍了Radau IIA类型的完全平行的预处理的实现，完全隐式runge-kutta方法，该方法近似于下三角矩阵从屠夫塔特雷（Butcher Tableau）$ a_q $的倒数，这是由LU分解而导致的，由LU分解和对数阶段的距离对直接进行。对于转换的系统，我们使用一个块预处理程序，在该块中，每个块由并行的一个过程子组分布和求解。为了结合部分结果，我们要么使用类似Cannon算法的通信模式或共享内存。使用无矩阵的有限元方法进行的性能模型和大量的性能研究（包括在3K计算节点上最多可进行150K计算节点的尺度运行，最多150k个流程），这表明当阶段平行的实现可以在较低的平行效率下在较低的平行效率下运行时，阶段平行的实现可以达到较高的吞吐量，从而实现较低的限制。可实现的速度随阶段数量线性增加，并受阶段数量的界限。此外，我们表明，所提出的阶段并行概念也适用于直接对角度化$ a_q $的情况，这需要复杂的算术或两乘两个块的解决方案，并将算法的各个部分顺序化。或者，要分配阶段并将其分配到不同的过程，我们讨论了从不同阶段进行分散操作的可能性。

We present an implementation of a fully stage-parallel preconditioner for Radau IIA type fully implicit Runge--Kutta methods, which approximates the inverse of $A_Q$ from the Butcher tableau by the lower triangular matrix resulting from an LU decomposition and diagonalizes the system with as many blocks as stages. For the transformed system, we employ a block preconditioner where each block is distributed and solved by a subgroup of processes in parallel. For combination of partial results, we either use a communication pattern resembling Cannon's algorithm or shared memory. A performance model and a large set of performance studies (including strong scaling runs with up to 150k processes on 3k compute nodes) conducted for a time-dependent heat problem, using matrix-free finite element methods, indicate that the stage-parallel implementation can reach higher throughputs when the block solvers operate at lower parallel efficiencies, which occurs near the scaling limit. Achievable speedup increases linearly with number of stages and are bounded by the number of stages. Furthermore, we show that the presented stage-parallel concepts are also applicable to the case that $A_Q$ is directly diagonalized, which requires complex arithmetic or the solution of two-by-two blocks and sequentializes parts of the algorithm. Alternatively to distributing stages and assigning them to distinct processes, we discuss the possibility of batching operations from different stages together.

下载PDF全文

下载文献需遵守相关版权规定

论文标题