论文标题
快速准确的DNA序列校正的知识蒸馏
Knowledge distillation for fast and accurate DNA sequence correction
论文作者
论文摘要
准确的基因组测序可以提高我们对生物学和疾病遗传基础的理解。从PACBIO仪器生成DNA序列的标准方法依赖于基于HMM的模型。在这里,我们介绍了蒸馏式深元 - 一种用于序列校正的蒸馏变压器编码器模型,该模型改善了基于HMM的方法,并考虑了运行时约束。蒸馏的深仔比较大的速度比较大的对应物高1.3倍,同时提高了基于HMM的方法的高质量读数(Q30)的产量提高1.69倍(较大型号的1.73倍)。随着基因组序列的提高精度,蒸馏的深频率改善了基因组序列分析的下游应用,例如将变体呼叫误差降低39%(较大模型的34%),并将基因组组装质量提高3.8%(较大模型的4.2%)。我们表明,蒸馏式深元学到的表示形式在更快和较慢的模型之间相似。
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer-encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models.