分布式内存$ \ MATHCAL {H} $ - 矩阵代数I：数据分布和矩阵-Dector乘法

论文标题

分布式内存$ \ MATHCAL {H} $ - 矩阵代数I：数据分布和矩阵-Dector乘法

Distributed-memory $\mathcal{H}$-matrix Algebra I: Data Distribution and Matrix-vector Multiplication

论文作者

Li, Yingzhou, Poulson, Jack, Ying, Lexing

论文摘要

我们介绍了$ \ Mathcal {H} $ - 矩阵和$ \ Mathcal {H} $ -Matrix-vector乘法的分布式 - 内存算法的数据分发方案。我们的数据分配方案避免了以前工作中使用的昂贵$ω（p^2）$调度程序，其中$ p $是流程的数量，而数据平衡的保存良好。基于数据分布，我们的分布式内存算法平均分布在$ p $过程中，并采用一种新颖的树木通信算法来降低潜伏成本。 The overall complexity of our algorithm is $O\Big(\frac{N \log N}{P} + α\log P + β\log^2 P \Big)$ for $\mathcal{H}$-matrices under weak admissibility condition, where $N$ is the matrix size, $α$ denotes the latency, and $β$ denotes the inverse带宽。从数值上讲，我们的算法用于解决各种过程中各种大小的二维问题和三维问题。在数千个过程中，仍观察到良好的并行效率。

We introduce a data distribution scheme for $\mathcal{H}$-matrices and a distributed-memory algorithm for $\mathcal{H}$-matrix-vector multiplication. Our data distribution scheme avoids an expensive $Ω(P^2)$ scheduling procedure used in previous work, where $P$ is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among $P$ processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is $O\Big(\frac{N \log N}{P} + α\log P + β\log^2 P \Big)$ for $\mathcal{H}$-matrices under weak admissibility condition, where $N$ is the matrix size, $α$ denotes the latency, and $β$ denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题