通过自然语言处理和合奏学习的过滤药物引起的肝损伤文献

论文标题

通过自然语言处理和合奏学习的过滤药物引起的肝损伤文献

Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

论文作者

Zhan, Xianghao, Wang, Fanjin, Gevaert, Olivier

论文摘要

药物诱导的肝损伤（DILI）描述了损害肝脏的药物的不良影响。在严重的DILI病例中，还报告了危及生命的结果，包括肝脏衰竭或死亡。因此，严格监测与所有批准药物的稀释事件，肝毒性成为新药候选人的重要评估。这些与DILI相关的报告记录在医院记录，临床试验结果中，以及包含初步体外和体内实验的研究论文中。通常，从以前的出版物中提取数据在很大程度上依赖于资源要求手动标签，这大大降低了信息提取过程的效率。人工智能的最新发展，特别是自然语言处理（NLP）技术的兴起，使生物医学文本的自动处理能够自动处理。在这项研究中，根据大规模数据分析（CAMDA）挑战提供的大约28,000篇论文（标题和摘要），我们根据过滤DILI文献进行了模型表演。在四个单词矢量化技术中，使用术语频率分段文档频率（TF-IDF）和Logistic回归的模型在我们的内部测试集的情况下以0.957的精度优于其他模型。此外，实施了具有相似总体性能的合奏模型，并进行了微调以降低虚假案例，以避免忽略潜在的DILI报告。在CAMDA委员会提供的Hold-Out验证数据中，合奏模型的高精度为0.954，F1得分为0.955。此外，通过模型解释确定了正/阴性预测中的重要词。总体而言，整体模型达到了令人满意的分类结果，研究人员可以进一步将其进一步使用，以快速过滤稀释相关的文献。

Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital records, in clinical trial results, and also in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from previous publications relies heavily on resource-demanding manual labelling, which considerably decreased the efficiency of the information extraction process. The recent development of artificial intelligence, particularly, the rise of natural language processing (NLP) techniques, enabled the automatic processing of biomedical texts. In this study, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked model performances on filtering out DILI literature. Among four word vectorization techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 with our in-house test set. Furthermore, an ensemble model with similar overall performances was implemented and was fine-tuned to lower the false-negative cases to avoid neglecting potential DILI reports. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data provided by the CAMDA committee. Moreover, important words in positive/negative predictions were identified via model interpretation. Overall, the ensemble model reached satisfactory classification results, which can be further used by researchers to rapidly filter DILI-related literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题