论文标题
Ecrecer:酶佣金编号建议和基于多核双核学习的基准测试
ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning
论文作者
论文摘要
将蛋白质序列与其催化的生化反应相关联的酶佣金(EC)数对于准确理解酶功能和细胞代谢至关重要。提出了许多AB-INITIO计算方法,以直接预测给定输入序列的EC数。但是,预测性能(准确性,召回,精度),现有方法的可用性和效率仍然有很大的改进空间。在这里,我们报告了Ecrecer,这是一个云平台,用于基于新颖的深度学习技术准确预测EC数字。为了构建eCrecer,我们评估了不同的蛋白质表示方法,并采用蛋白质序列嵌入的蛋白质语言模型。嵌入后,我们提出了一个多代理层次结构的基于深度学习的框架,以多任务的方式学习提出的任务。具体而言,我们使用了极端的多标签分类器来执行EC预测,并采用了贪婪的策略来整合和微调最终模型。针对四种代表性方法的比较分析表明,Ecrecer的性能最高,这将精度和F1得分分别提高了70%和20%。借助Ecrecer,我们可以在瑞士 - 推杆数据库中注释众多酶,其EC数字不完整到其整个第四级。以Uniport蛋白“ A0A0U5GJ41”为例(1.14 .-.-),Ecrecer用“ 1.14.11.38”注释,该蛋白基于AlphaFold2的进一步蛋白质结构分析支持。最后,我们建立了一个Web服务器(https://ecrecer.biodesign.ac.cn),并提供了一个离线捆绑包来提高可用性。
Enzyme Commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab-initio computational approaches were proposed to predict EC numbers for given input sequences directly. However, the prediction performance (accuracy, recall, precision), usability, and efficiency of existing methods still have much room to be improved. Here, we report ECRECer, a cloud platform for accurately predicting EC numbers based on novel deep learning techniques. To build ECRECer, we evaluate different protein representation methods and adopt a protein language model for protein sequence embedding. After embedding, we propose a multi-agent hierarchy deep learning-based framework to learn the proposed tasks in a multi-task manner. Specifically, we used an extreme multi-label classifier to perform the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against four representative methods demonstrate that ECRECer delivers the highest performance, which improves accuracy and F1 score by 70% and 20% over the state-of-the-the-art, respectively. With ECRECer, we can annotate numerous enzymes in the Swiss-Prot database with incomplete EC numbers to their full fourth level. Take UniPort protein "A0A0U5GJ41" as an example (1.14.-.-), ECRECer annotated it with "1.14.11.38", which supported by further protein structure analysis based on AlphaFold2. Finally, we established a webserver (https://ecrecer.biodesign.ac.cn) and provided an offline bundle to improve usability.