论文标题
高阶预处理的端到端GPU加速度,用于高阶有限元离散化
End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations
论文作者
论文摘要
在本文中,我们介绍了无矩阵低阶的高阶预处理端到端GPU加速度的算法和实现,高阶有限元问题。此处描述的方法允许构建有效的预处理,以解决具有最佳记忆使用和计算复杂性的高阶问题。预处理基于在精制网格上构建频谱等效的低阶离散化,然后可以像代数多机预处理一样。等效的常数与网格大小和多项式程度无关。对于$ h({\ rm curl})和$ h({\ rm div})中的矢量有限元问题(例如,对于电磁或辐射扩散问题)$ h(例如,用于电磁或辐射扩散问题)专门构建的插值 - 托管基础用于确保快速融合。进行详细的性能研究以分析GPU算法的效率。测量每个主要算法组件的内核吞吐量,并证明了该方法的强和弱的并行可伸缩性。讨论了算法成分在GPU和CPU上的不同相对权重和重要性。显示了有关涉及自适应精制不合格网格的问题,并说明了使用有限元de Rham复合物的所有空间在大规模的磁扩散问题上使用预处理。
In this paper, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in $H({\rm curl})$ and $H({\rm div})$ (e.g. for electromagnetic or radiation diffusion problems) a specially constructed interpolation-histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.