后代：蛋白质产生的语言建模

论文标题

后代：蛋白质产生的语言建模

ProGen: Language Modeling for Protein Generation

论文作者

Madani, Ali, McCann, Bryan, Naik, Nikhil, Keskar, Nitish Shirish, Anand, Namrata, Eguchi, Raphael R., Huang, Po-Ssu, Socher, Richard

论文摘要

蛋白质工程的生成建模是解决合成生物学，医学和材料科学中基本问题的关键。我们将蛋白质工程作为无监督的序列产生问题，以利用缺乏昂贵的结构注释的成倍增长的蛋白质集。我们以〜280m的蛋白质序列为基于分类学和关键字标签（例如分子功能和细胞成分）训练1.2b参数语言模型，后代。这为祖提供了前所未有的进化序列多样性范围，并允许它以基于主要序列相似性，二级结构准确性和构象能的指标证明，它可以通过细粒度控制产生。

Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ~280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题