论文标题

从头基因组组件的平行弦图构造和及时减少

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

论文作者

Guidi, Giulia, Selvitopi, Oguz, Ellis, Marquita, Oliker, Leonid, Yelick, Katherine, Buluc, Aydin

论文摘要

计算生物学中最重要的密集任务之一是从头基因组组装,这是从冗余和错误的短序列中解码未知基因组的序列。一个通用的组装范式标识重叠序列,简化其布局并创建共识。尽管文献中开发了许多算法,但大基因组的有效组装仍然是一个开放的问题。在这项工作中,我们介绍了新的分布式内存并行算法,用于重叠检测和从头基因组组装的布局简化步骤,并在Dibella 2D管道中实现它们。我们用于重叠检测和布局简化的分布式存储算法基于使用2D分布式稀疏矩阵在半度性上的线性代数操作。我们的布局步骤包括执行从重叠图到字符串图的传递降低。我们对新算法的主要阶段提供了详细的沟通分析。 Dibella 2d在人类基因组中以超过80%的平行效率达到线性缩放,与人类基因组的重叠检测时间降低了1.2-1.3倍,而秀丽隐杆线虫则与最先进的ART相比。我们的传递还原算法的表现优于人类基因组的现有分布式内存实现,而对于秀丽隐杆线虫的基因组为18-29x。我们的工作为从分布式内存中的长读数铺平了大型基因组的从头组装铺平道路。

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源