论文标题
使用Graph Wavefront算法的快速序列到图形对齐
Fast sequence to graph alignment using the graph wavefront algorithm
论文作者
论文摘要
动机:泛基因组图表示基因组的集合,并编码它们之间的序列变化。它是研究多个类似基因组的强大数据结构。序列到图形比对是泛基因组图的构建和分析的重要步骤。但是,现有的算法与序列长度和图形大小的乘积成正比产生了运行时,从而使它们无法对齐长序列与大图。结果:我们提出了Graph WavePront Arignment算法(GWFA),这是将序列与序列图对齐的新方法。尽管GWFA最糟糕的时间复杂性与现有算法相同,但它旨在更快地运行以与匹配的序列紧密相匹配,而实践中的运行时间通常仅随着最佳对齐方式的编辑距离而适度增加。在四个真实数据集上,GWFA的数量级最多要比其他精确的序列到格言比对算法快四个数量级。我们还建议在GWFA之上修剪启发式启发式,该图可以在大图上获得额外的$ \ sim $ 10倍。可用性:GWFA代码可在https://github.com/lh3/gwfa上访问。
Motivation: A pan-genome graph represents a collection of genomes and encodes sequence variations between them. It is a powerful data structure for studying multiple similar genomes. Sequence-to-graph alignment is an essential step for the construction and the analysis of pan-genome graphs. However, existing algorithms incur runtime proportional to the product of sequence length and graph size, making them inefficient for aligning long sequences against large graphs. Results: We propose the graph wavefront alignment algorithm (Gwfa), a new method for aligning a sequence to a sequence graph. Although the worst-case time complexity of Gwfa is the same as the existing algorithms, it is designed to run faster for closely matching sequences, and its runtime in practice often increases only moderately with the edit distance of the optimal alignment. On four real datasets, Gwfa is up to four orders of magnitude faster than other exact sequence-to-graph alignment algorithms. We also propose a graph pruning heuristic on top of Gwfa, which can achieve an additional $\sim$10-fold speedup on large graphs. Availability: Gwfa code is accessible at https://github.com/lh3/gwfa.