论文标题
洛根:高性能GPU的X-Drop长阅读对齐
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
论文作者
论文摘要
成对序列比对是基因组数据分析中计算中最密集的内核之一,占关键生物信息学应用程序的运行时的90%以上。由于分析1KB和1MB之间长度序列的计算成本高,因此对于第三代序列而言,此方法特别昂贵。考虑到长期对齐的精确成对算法的二次开销,社区主要依赖于仅搜索高质量比对的近似算法,并在找不到一个人时早点停止。在这项工作中,我们介绍了我们命名为Logan的流行X-Drop Alignment算法的第一个GPU优化。结果表明,我们使用1和6 NVIDIA TESLA V100的高性能多GPU实现可达到181.6个GCUP和加速度高达6.6倍和30.7倍,而使用168 CPU螺纹在两个IBM Power9处理器上运行的最先进的软件,并具有等效率,并具有168 CPU螺纹。我们还展示了2.3倍的洛根加速与KSW2,这是一种用于Minimap2(一种长阅读映射软件)中的序列对齐的最先进的矢量化算法。为了强调我们作品对现实世界应用程序的影响,我们将洛根与贝拉(Bella)的多对多长阅读对齐软件相结合,并证明我们的实施将整个贝拉运行时提高到10.6倍。最后,我们适应了洛根(Logan)的车顶线模型,并证明我们的实施在Nvidia Tesla V100上几乎是最佳的。
Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6x. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s.