基于NLP的软件工具分类用于宏基因组学测序数据分析到EDAM语义注释中

论文标题

基于NLP的软件工具分类用于宏基因组学测序数据分析到EDAM语义注释中

NLP-based classification of software tools for metagenomics sequencing data analysis into EDAM semantic annotation

论文作者

Hiri, Kaoutar Daoud, Hren, Matjaž, Curk, Tomaž

论文摘要

动机：宏基因组测序数据的快速增长使宏基因组学越来越依赖于快速有效分析的计算和统计方法。因此，大数据宏基因组学的新颖分析工具不断出现。研究人员面临的最大挑战之一是在分析计划阶段：选择最合适的宏基因组软件工具，以从测序数据中获得宝贵的见解。数据分析管道的构建过程通常很费力且耗时，因为它需要对如何应用特定工具完成指定的元基因组学任务有深入而批判的了解。结果：我们已经通过使用机器学习方法来根据工具的描述来开发宏基因组软件工具的分类系统（EDAM和两个特定于病毒的类别的11个语义注释），从而解决了这一挑战。我们使用15种文本提取技术（TF-IDF，Glove，基于BERT的模型等）训练了三个分类器（天真的贝叶斯，逻辑回归和随机森林）。手动策划的数据集包含224个软件工具，并包含摘要和工具出版物的方法部分中的文本。最佳的分类性能是使用逻辑回归，文本嵌入的生物Biobert以及仅来自摘要的文本，在Precision-Recall曲线分数下的区域为0.85。所提出的系统提供了宏基因组学数据分析工具和任务的准确而统一的识别，这是构建元基因组学数据分析管道的关键步骤。

Motivation: The rapid growth of metagenomics sequencing data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. Consequently, novel analysis tools for big-data metagenomics are constantly emerging. One of the biggest challenges for researchers occurs in the analysis planning stage: selecting the most suitable metagenomics software tool to gain valuable insights from sequencing data. The building process of data analysis pipelines is often laborious and time-consuming since it requires a deep and critical understanding of how to apply a particular tool to complete a specified metagenomics task. Results: We have addressed this challenge by using machine learning methods to develop a classification system of metagenomics software tools into 13 classes (11 semantic annotations of EDAM and two virus-specific classes) based on the descriptions of the tools. We trained three classifiers (Naive Bayes, Logistic Regression, and Random Forest) using 15 text feature extraction techniques (TF-IDF, GloVe, BERT-based models, and others). The manually curated dataset includes 224 software tools and contains text from the abstract and the methods section of the tools' publications. The best classification performance, with an Area Under the Precision-Recall Curve score of 0.85, is achieved using Logistic regression, BioBERT for text embedding, and text from abstracts only. The proposed system provides accurate and unified identification of metagenomics data analysis tools and tasks, which is a crucial step in the construction of metagenomics data analysis pipelines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题