论文标题
弹性排级弦匹配与1个错误
Elastic-Degenerate String Matching with 1 Error
论文作者
论文摘要
弹性排级字符串是$ n $有限的总长度$ n $的序列,用于代表一组相关的DNA序列,也称为pangenome。 ED字符串匹配(EDSM)问题包括报告ED文本中长度$ M $的所有发生。该问题最近受到了组合模式匹配社区的关注,最终以$ \ tilde {\ Mathcal {o}}(nm^{ω-1})+\ Mathcal {o} {o}(o}(n)$ - 时间算法[Bernardini等人[Bernardini等人,SIAM J.计算。 2022],其中$ω$表示矩阵乘法指数,而$ \ tilde {\ Mathcal {o}}}(\ cdot)$表示法抑制了polygog因子。在$ K $ -EDSM问题(EDSM的大约版本)中,我们被要求报告所有模式出现,最多$ k $错误。 $ k $ -EDSM可以用$ \ Mathcal {o}(k^2mg+kn)$时间,在编辑距离下,或$ \ Mathcal {o}(kmg+kn)$ time,hamming距离,$ g $,其中$ g $表示ED文本中的条纹总数[Bernardinii等。计算。科学。 2020]。不幸的是,$ g $仅由$ n $界定,因此,即使对于$ k = 1 $,现有算法在最坏的情况下以$ω(MN)$时间运行。在本文中,我们表明$ 1 $ -EDSM可以在$ \ Mathcal {o}(((nm^2 + n)\ log m)$或$ \ mathcal {o}(nm^3 + n)$下的编辑距离下。对于决策版本,我们提出了一个更快的$ \ Mathcal {o}(NM^2 \ sqrt {\ log M} + n \ log \ log \ log m)$ - 时间算法。我们还表明,$ 1 $ -EDSM可以用$ \ Mathcal {o}(NM^2 + N \ log M)$时间在Hamming距离下。我们用于编辑距离的算法依赖于从$ 1 $ -EDSM降低到经典计算几何问题的特殊实例(2D Rectangle刺或2D范围空虚),我们显示如何有效地解决。为了获得更快的算法,我们依靠和适应$ k $ -errata树来用错误索引[Cole等,Stoc,2004]。
An elastic-degenerate string is a sequence of $n$ finite sets of strings of total length $N$, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length $m$ in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an $\tilde{\mathcal{O}}(nm^{ω-1})+\mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $ω$ denotes the matrix multiplication exponent and the $\tilde{\mathcal{O}}(\cdot)$ notation suppresses polylog factors. In the $k$-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most $k$ errors. $k$-EDSM can be solved in $\mathcal{O}(k^2mG+kN)$ time, under edit distance, or $\mathcal{O}(kmG+kN)$ time, under Hamming distance, where $G$ denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, $G$ is only bounded by $N$, and so even for $k=1$, the existing algorithms run in $Ω(mN)$ time in the worst case. In this paper we show that $1$-EDSM can be solved in $\mathcal{O}((nm^2 + N)\log m)$ or $\mathcal{O}(nm^3 + N)$ time under edit distance. For the decision version, we present a faster $\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)$-time algorithm. We also show that $1$-EDSM can be solved in $\mathcal{O}(nm^2 + N\log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from $1$-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the $k$-errata trees for indexing with errors [Cole et al., STOC 2004].