论文标题
最短常见超串和基因组组装的计算复杂性中的相变
Phase transition in the computational complexity of the shortest common superstring and genome assembly
论文作者
论文摘要
基因组组装是通过对齐和合并短片段或读取的长期遗传序列的过程,已知是NP- hard,是最短的常见超串问题的一种版本,或者是在汉密尔顿周期的制定中。也就是说,在最坏情况下,计算时间随着问题大小而成倍增长。尽管如此,目前,高通量技术和现代算法允许生物信息学家处理数十亿读的数据集。使用统计力学中的方法,我们通过证明问题的计算复杂性中存在相变的存在来解决这个难题,并表明实际实例始终属于“简单”阶段(通过多项式时间算法可求解)。此外,我们提出了一种马尔可夫链蒙特卡洛方法,该方法在硬度方面优于共同的确定性算法。
Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the 'easy' phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.