基因组变体通话具有深厚的平均网络

论文标题

基因组变体通话具有深厚的平均网络

Genome Variant Calling with a Deep Averaging Network

论文作者

Yakovenko, Nikolai, Lal, Avantika, Israeli, Johnny, Catanzaro, Bryan

论文摘要

变体调用，估计DNA序列中的位置是否与参考序列有所不同，因为嘈杂，冗余，重叠的短序列涵盖该位置是基因组学的基础。我们提出了一个专门为变体呼叫设计的深度平均网络。我们的模型通过通过一系列卷积层转换单个读取序列的每个简短输入读取顺序的独立性，从而将单个读取之间的通信限制为平均和串联操作。对PrecisionFDA真相挑战（PFDA）的培训和测试，我们与最新状态相匹配99.89 F1得分。基因组数据集在简单的示例和决策边界上的示例之间表现出极端偏斜。我们利用该属性以5倍的速度收敛于标准时期训练的速度，通过在训练过程中跳过简单的示例。为了促进未来的工作，我们发布了代码，训练有素的模型和预处理的公共领域数据集。

Variant calling, the problem of estimating whether a position in a DNA sequence differs from a reference sequence, given noisy, redundant, overlapping short sequences that cover that position, is fundamental to genomics. We propose a deep averaging network designed specifically for variant calling. Our model takes into account the independence of each short input read sequence by transforming individual reads through a series of convolutional layers, limiting the communication between individual reads to averaging and concatenating operations. Training and testing on the precisionFDA Truth Challenge (pFDA), we match state of the art overall 99.89 F1 score. Genome datasets exhibit extreme skew between easy examples and those on the decision boundary. We take advantage of this property to converge models at 5x the speed of standard epoch-based training by skipping easy examples during training. To facilitate future work, we release our code, trained models and pre-processed public domain datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题