PSP：用于蛋白质结构预测的百万级蛋白质序列数据集

论文标题

PSP：用于蛋白质结构预测的百万级蛋白质序列数据集

PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction

论文作者

Liu, Sirui, Zhang, Jun, Chu, Haotian, Wang, Min, Xue, Boxin, Ni, Ningxi, Yu, Jialiang, Xie, Yuhao, Chen, Zhenyu, Chen, Mengyun, Liu, Yuan, Patra, Piya, Xu, Fan, Chen, Jie, Wang, Zidong, Yang, Lijiang, Yu, Fan, Chen, Lei, Gao, Yi Qin

论文摘要

蛋白质是人类生命的重要组成部分，其结构对于功能和机制分析很重要。最近的工作表明了AI驱动方法对蛋白质结构预测的潜力。但是，缺乏数据集和基准培训程序的限制，新模型的开发受到限制。据我们所知，现有的开源数据集要少于满足现代蛋白质序列相关研究的需求。为了解决这个问题，我们介绍了具有高覆盖率和多样性（称为PSP）的第一个百万级蛋白质结构预测数据集。该数据集由570K真实结构序列（10TB）和745K互补蒸馏序（15TB）组成。此外，我们还提供了该数据集上SOTA蛋白结构预测模型的基准测试训练程序。我们通过参与的客串比赛验证该数据集的实用程序进行培训，我们的模特赢得了第一名。我们希望我们的PSP数据集以及培训基准能够为AI驱动的蛋白质相关研究提供更广泛的AI/生物学研究人员社区。

Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题