蜘蛛网生成具有较高误差和实时信息检索能力的编码算法

论文标题

蜘蛛网生成具有较高误差和实时信息检索能力的编码算法

SPIDER-WEB generates coding algorithms with superior error tolerance and real-time information retrieval capacity

论文作者

Zhang, Haoling, Lan, Zhaojun, Zhang, Wenwei, Xu, Xun, Ping, Zhi, Zhang, Yiwei, Shen, Yue

论文摘要

DNA被认为是存储数字信息的有希望的媒介。作为基于DNA的数据存储工作流的重要步骤，编码算法负责实现包括位至基础转码，误差校正等功能。在先前的研究中，通常通过引入多种算法来实现这些功能。在这里，我们报告了一个名为Spider-Web的基于图的体系结构，通过自动生成自定义算法来提供多合一的编码解决方案。 SpiderWeb能够在DNA序列中最多校正4％的编辑错误，包括替换和插入/删除（Indel），只有5.5％的冗余符号。由于校正和解码过程不需要DNA序列预处理，因此蜘蛛网提供了实时信息检索的功能，该功能比单分子测序技术的速度快305.08倍。与Megabyte级数据下的常规数据相比，我们的检索过程可以更快地提高2个数量级，并且可以扩展以适合Exabyte级数据。因此，Spider-Web具有改善大规模数据存储应用程序的实用性的潜力。

DNA has been considered a promising medium for storing digital information. As an essential step in the DNA-based data storage workflow, coding algorithms are responsible to implement functions including bit-to-base transcoding, error correction, etc. In previous studies, these functions are normally realized by introducing multiple algorithms. Here, we report a graph-based architecture, named SPIDER-WEB, providing an all-in-one coding solution by generating customized algorithms automatically. SPIDERWEB is able to correct a maximum of 4% edit errors in the DNA sequences including substitution and insertion/deletion (indel), with only 5.5% redundant symbols. Since no DNA sequence pretreatment is required for the correcting and decoding processes, SPIDER-WEB offers the function of real-time information retrieval, which is 305.08 times faster than the speed of single-molecule sequencing techniques. Our retrieval process can improve 2 orders of magnitude faster compared to the conventional one under megabyte-level data and can be scalable to fit exabyte-level data. Therefore, SPIDER-WEB holds the potential to improve the practicability in large-scale data storage applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题