论文标题

测试数据集的建议评估病理中的AI解决方案

Recommendations on test datasets for evaluating AI solutions in pathology

论文作者

Homeyer, André, Geißler, Christian, Schwen, Lars Ole, Zakrzewski, Falk, Evans, Theodore, Strohmenger, Klaus, Westphal, Max, Bülow, Roman David, Kargl, Michaela, Karjauv, Aray, Munné-Bertran, Isidre, Retzlaff, Carl Orge, Romero-López, Adrià, Sołtysiński, Tomasz, Plass, Markus, Carvalho, Rita, Steinbach, Peter, Lan, Yu-Chia, Bouteldja, Nassim, Haber, David, Rojas-Carulla, Mateo, Sadr, Alireza Vafaei, Kraft, Matthias, Krüger, Daniel, Fick, Rutger, Lang, Tobias, Boor, Peter, Müller, Heimo, Hufnagl, Peter, Zerbe, Norman

论文摘要

自动从数字组织学图像中提取信息的人工智能(AI)解决方案已显示出改善病理诊断的巨大希望。在常规使用之前,重要的是要评估其预测性能并获得法规批准。此评估需要适当的测试数据集。但是,编译此类数据集具有挑战性,并且缺少具体建议。 包括商业AI开发人员,病理学家和研究人员在内的各种利益相关者的委员会讨论了关键方面,并在病理学中对测试数据集进行了广泛的文献综述。在这里,我们总结了结果,并为收集测试数据集的一般建议提供了一般建议。 我们解决了几个问题:需要哪个图像?如何应对低贫困子集?如何检测到潜在的偏见?应该如何报告数据集?不同国家的监管要求是什么? 这些建议旨在帮助AI开发人员证明其产品的实用性,并帮助监管机构和最终用户验证报告的绩效指标。需要进一步的研究来制定足够代表性的测试数据集的标准,以便AI解决方案可以在用户干预较少的情况下运行,并在将来更好地支持诊断工作流程。

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations for the collection of test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help regulatory agencies and end users verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源