论文标题
用于压缩文本索引的新系列字符串转换
A New Class of String Transformations for Compressed Text Indexing
论文作者
论文摘要
Burrows-wheeler Transform(BWT)大约30年前在数据压缩领域中引入了弦变换,除了成为无内存压缩机的性能的助推器外,它在设计有效的自我索引压缩数据结构的设计中起着至关重要的作用。对于许多研究人员来说,找到具有BWT相同特性的其他弦变换是长期以来的挑战。在已知的BWT变体中,最近唯一被证明是BWT的有效替代品的是交替的BWT(ABWT),这是大约十年前与Lyndon单词的概括有关的另一种可逆字符串变换。在本文中,我们介绍了一系列新的字符串转换,称为基于本地订购的转换,这些转换具有BWT的所有无数美德。我们表明,这个新家庭是基于上下文自适应字母顺序的更大转换类别的特殊情况,其中包括BWT和ABWT。尽管所有转换都支持模式搜索,但我们表明,在一般情况下,我们较大的类中的转换可能需要二次倒置和模式搜索的时间。进一步的结果,我们表明,基于本地订购的转换可用于构建最近引入的R-Index,这也适合于高度重复的收集。在这种情况下,我们考虑了找到给定字符串的问题的问题,即最小化转换字符串中运行次数的BWT变体,并且我们提供了一种算法,可以在线性时间内解决此问题。
Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. Among the known BWT variants, the only one that has been recently shown to be a valid alternative to BWT is the Alternating BWT (ABWT), another invertible string transformation introduced about ten years ago in connection with a generalization of Lyndon words. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the myriad virtues of BWT. We show that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT. Although all transformations support pattern search, we show that, in the general case, the transformations within our larger class may take quadratic time for inversion and pattern search. As a further result, we show that the local orderings-based transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string, and we provide an algorithm solving this problem in linear time.