单细胞基因表达语言模型

论文标题

单细胞基因表达语言模型

A single-cell gene expression language model

论文作者

Connell, William, Khan, Umair, Keiser, Michael J.

论文摘要

基因调节是连接基因型和表型的动态过程。考虑到物理映射哺乳动物基因回路的困难，我们需要新的计算方法来学习调节规则。自然语言与监管控制的交流是一个有价值的类比。机器学习系统通过在单词之间明确学习上下文依赖性来建模自然语言。我们提出了一个应用于单细胞RNA表达谱的类似系统，以学习基因之间的上下文依赖性。我们的模型Exceiver使用为离散计数数据制定的自制任务对各种细胞类型进行了训练，这考虑了特征稀疏性。我们发现潜在样本表示的相似性曲线与有关生物学注释的学识渊博的基因嵌入之间的一致性。我们在新的数据集和下游预测任务上评估了Exceiver，并发现训练预处理支持转移学习。我们的工作提供了一个框架，可以在单细胞级别上对基因调节进行建模，并将知识转移到下游任务。

Gene regulation is a dynamic process that connects genotype and phenotype. Given the difficulty of physically mapping mammalian gene circuitry, we require new computational methods to learn regulatory rules. Natural language is a valuable analogy to the communication of regulatory control. Machine learning systems model natural language by explicitly learning context dependencies between words. We propose a similar system applied to single-cell RNA expression profiles to learn context dependencies between genes. Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task formulated for discrete count data, accounting for feature sparsity. We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations. We evaluated Exceiver on a new dataset and a downstream prediction task and found that pretraining supports transfer learning. Our work provides a framework to model gene regulation on a single-cell level and transfer knowledge to downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题