论文标题

使用Multype分支过程中的测序计算独特的分子标识符

Counting unique molecular identifiers in sequencing using a multitype branching process with immigration

论文作者

Sagitov, Serik, Ståhlberg, Anders

论文摘要

由于测序误差,在DNA分子的复杂混合物中检测极罕见的变体等位基因(例如肿瘤DNA)在实验上具有挑战性。靶DNA分子在图书馆结构中进行下一代测序的条形码,提供了一种识别和生物信息上去除聚合酶诱导误差的方法。在涉及$ t $连续PCR周期的条形码过程中,DNA分子被独特的分子标识符(UMI)进行了条形码。不同的图书馆构造协议利用$ t $的不同值。较大的$ t $和不完美的PCR扩增的影响很差。 本文提出了一个分支过程,将移民日益增长作为描述PCR条形码$ T $循环的随机结果的模型。我们的模型区分了五个不同的放大率$ r_1 $,$ r_2 $,$ r_3 $,$ r_4 $,$ r $,用于与PCR条形码相关的不同类型的分子。我们通过专注于$ C_T $,共享相同UMI的分子簇以及$ C_T(M)$,尺寸$ M $的UMI簇数量来研究该模型。我们的主要发现是一种显着的渐近模式,适用于中等大的$ t $。事实证明,$ e(c_t(m))/e(c_t)\大约2^{ - m} $ for $ m = 1,2,\ ldots $,无论基本参数$(r_1,r_1,r_2,r_3,r_3,r_4,r_4,r)$。 $ C_T $和$ C_T(M)$作为实验参数$ T $和$(R_1,R_2,R_3,R_4,R)$的知识将帮助用户得出从不同测序协议的结果中得出更充分的结论。

Detection of extremely rare variant alleles, such as tumour DNA, within a complex mixture of DNA molecules is experimentally challenging due to sequencing errors. Barcoding of target DNA molecules in library construction for next-generation sequencing provides a way to identify and bioinformatically remove polymerase induced errors. During the barcoding procedure involving $t$ consecutive PCR cycles, the DNA molecules become barcoded by unique molecular identifiers (UMI). Different library construction protocols utilise different values of $t$. The effect of a larger $t$ and imperfect PCR amplifications is poorly described. This paper proposes a branching process with growing immigration as a model describing the random outcome of $t$ cycles of PCR barcoding. Our model discriminates between five different amplification rates $r_1$, $r_2$, $r_3$, $r_4$, $r$ for different types of molecules associated with the PCR barcoding procedure. We study this model by focussing on $C_t$, the number of clusters of molecules sharing the same UMI, as well as $C_t(m)$, the number of UMI clusters of size $m$. Our main finding is a remarkable asymptotic pattern valid for moderately large $t$. It turns out that $E(C_t(m))/E(C_t)\approx 2^{-m}$ for $m=1,2,\ldots$, regardless of the underlying parameters $(r_1,r_2,r_3,r_4,r)$. The knowledge of the quantities $C_t$ and $C_t(m)$ as functions of the experimental parameters $t$ and $(r_1,r_2,r_3,r_4,r)$ will help the users to draw more adequate conclusions from the outcomes of different sequencing protocols.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源