最短常见超串和基因组组装的计算复杂性中的相变

论文标题

最短常见超串和基因组组装的计算复杂性中的相变

Phase transition in the computational complexity of the shortest common superstring and genome assembly

论文作者

Fernandez, L. A., Martin-Mayor, V., Yllanes, D.

论文摘要

基因组组装是通过对齐和合并短片段或读取的长期遗传序列的过程，已知是NP- hard，是最短的常见超串问题的一种版本，或者是在汉密尔顿周期的制定中。也就是说，在最坏情况下，计算时间随着问题大小而成倍增长。尽管如此，目前，高通量技术和现代算法允许生物信息学家处理数十亿读的数据集。使用统计力学中的方法，我们通过证明问题的计算复杂性中存在相变的存在来解决这个难题，并表明实际实例始终属于“简单”阶段（通过多项式时间算法可求解）。此外，我们提出了一种马尔可夫链蒙特卡洛方法，该方法在硬度方面优于共同的确定性算法。

Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the 'easy' phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.

下载PDF全文

下载文献需遵守相关版权规定

论文标题