使用Multype分支过程中的测序计算独特的分子标识符

论文标题

使用Multype分支过程中的测序计算独特的分子标识符

Counting unique molecular identifiers in sequencing using a multitype branching process with immigration

论文作者

Sagitov, Serik, Ståhlberg, Anders

论文摘要

由于测序误差，在DNA分子的复杂混合物中检测极罕见的变体等位基因（例如肿瘤DNA）在实验上具有挑战性。靶DNA分子在图书馆结构中进行下一代测序的条形码，提供了一种识别和生物信息上去除聚合酶诱导误差的方法。在涉及$ t $连续PCR周期的条形码过程中，DNA分子被独特的分子标识符（UMI）进行了条形码。不同的图书馆构造协议利用$ t $的不同值。较大的$ t $和不完美的PCR扩增的影响很差。本文提出了一个分支过程，将移民日益增长作为描述PCR条形码$ T $循环的随机结果的模型。我们的模型区分了五个不同的放大率$ r_1 $，$ r_2 $，$ r_3 $，$ r_4 $，$ r $，用于与PCR条形码相关的不同类型的分子。我们通过专注于$ C_T $，共享相同UMI的分子簇以及$ C_T（M）$，尺寸$ M $的UMI簇数量来研究该模型。我们的主要发现是一种显着的渐近模式，适用于中等大的$ t $。事实证明，$ e（c_t（m））/e（c_t）\大约2^{ - m} $ for $ m = 1,2，\ ldots $，无论基本参数$（r_1，r_1，r_2，r_3，r_3，r_4，r_4，r）$。 $ C_T $和$ C_T（M）$作为实验参数$ T $和$（R_1，R_2，R_3，R_4，R）$的知识将帮助用户得出从不同测序协议的结果中得出更充分的结论。

Detection of extremely rare variant alleles, such as tumour DNA, within a complex mixture of DNA molecules is experimentally challenging due to sequencing errors. Barcoding of target DNA molecules in library construction for next-generation sequencing provides a way to identify and bioinformatically remove polymerase induced errors. During the barcoding procedure involving $t$ consecutive PCR cycles, the DNA molecules become barcoded by unique molecular identifiers (UMI). Different library construction protocols utilise different values of $t$. The effect of a larger $t$ and imperfect PCR amplifications is poorly described. This paper proposes a branching process with growing immigration as a model describing the random outcome of $t$ cycles of PCR barcoding. Our model discriminates between five different amplification rates $r_1$, $r_2$, $r_3$, $r_4$, $r$ for different types of molecules associated with the PCR barcoding procedure. We study this model by focussing on $C_t$, the number of clusters of molecules sharing the same UMI, as well as $C_t(m)$, the number of UMI clusters of size $m$. Our main finding is a remarkable asymptotic pattern valid for moderately large $t$. It turns out that $E(C_t(m))/E(C_t)\approx 2^{-m}$ for $m=1,2,\ldots$, regardless of the underlying parameters $(r_1,r_2,r_3,r_4,r)$. The knowledge of the quantities $C_t$ and $C_t(m)$ as functions of the experimental parameters $t$ and $(r_1,r_2,r_3,r_4,r)$ will help the users to draw more adequate conclusions from the outcomes of different sequencing protocols.

下载PDF全文

下载文献需遵守相关版权规定

论文标题