论文标题
通过学习级别的梯度增强决策树和基于富集的累积增益,通过学习级别进行复合虚拟筛选
Compound virtual screening by learning-to-rank with gradient boosting decision tree and enrichment-based cumulative gain
论文作者
论文摘要
学习到级别是一种广泛用于信息检索的机器学习技术,最近已应用于基于配体的虚拟筛查问题,以加速新药开发的早期阶段。对预测模型的排名根据序数关系学习,使其适合从各种环境中集成测定数据。现有的化合物筛选中排名预测的研究通常使用了一种称为rankSVM的学习对方法。但是,尚未将它们与梯度提升决策树(GBDT)基于梯度的学习到级别的方法进行比较或验证,这些方法最近越来越受欢迎。此外,尽管排名指标称为归一化折扣累积增益(NDCG)的标准广泛用于信息检索,但它仅确定预测是否比其他模型的预测更好。换句话说,NDCG无法识别何时预测模型比随机结果差。然而,NDCG仍然用于使用学习级学习的化合物筛选的性能评估。这项研究使用具有基于配体的虚拟筛查的GBDT模型,称为Lambdarank和Lambdaloss,称为Lambdarank和Lambdaloss;使用回归将结果与现有的RankSVM方法和GBDT模型进行了比较。我们还提出了一个新的排名指标,标准化的富集折扣累积增益(NEDCG),旨在正确评估排名预测的好处。结果表明,使用GBDT和RankSVM在不同数据集上的GBDT模型优于现有的回归方法。此外,NEDCG表明,通过回归预测与多户多户数据集中的随机预测相媲美,这证明了其对更直接评估复合筛选性能的有用性。
Learning-to-rank, a machine learning technique widely used in information retrieval, has recently been applied to the problem of ligand-based virtual screening, to accelerate the early stages of new drug development. Ranking prediction models learn based on ordinal relationships, making them suitable for integrating assay data from various environments. Existing studies of rank prediction in compound screening have generally used a learning-to-rank method called RankSVM. However, they have not been compared with or validated against the gradient boosting decision tree (GBDT)-based learning-to-rank methods that have gained popularity recently. Furthermore, although the ranking metric called Normalized Discounted Cumulative Gain (NDCG) is widely used in information retrieval, it only determines whether the predictions are better than those of other models. In other words, NDCG is incapable of recognizing when a prediction model produces worse than random results. Nevertheless, NDCG is still used in the performance evaluation of compound screening using learning-to-rank. This study used the GBDT model with ranking loss functions, called lambdarank and lambdaloss, for ligand-based virtual screening; results were compared with existing RankSVM methods and GBDT models using regression. We also proposed a new ranking metric, Normalized Enrichment Discounted Cumulative Gain (NEDCG), which aims to properly evaluate the goodness of ranking predictions. Results showed that the GBDT model with learning-to-rank outperformed existing regression methods using GBDT and RankSVM on diverse datasets. Moreover, NEDCG showed that predictions by regression were comparable to random predictions in multi-assay, multi-family datasets, demonstrating its usefulness for a more direct assessment of compound screening performance.