论文标题
关于公开可用的COVID-19 X射线成像数据集的组成和局限
On the Composition and Limitations of Publicly Available COVID-19 X-Ray Imaging Datasets
论文作者
论文摘要
在过去的几个月中,基于机器学习的诊断和进展预测来自成像数据的诊断和进展预测已引起了人们的关注,特别是通过使用深度学习模型。在这种情况下,数百种模型与大多数在公共数据集中受过培训的模型。数据稀缺性,培训和目标人群之间的不匹配,群体失衡以及缺乏文档是偏见的重要来源,阻碍了这些模型对现实世界中的临床实践的适用性。考虑到数据集是模型构建和评估的重要组成部分,需要对当前景观有更深入的了解。本文概述了当前可用的Covid-19胸部X射线数据集的概述。简要描述每个数据集,并确定数据集之间的潜在强度,局限性和相互作用。特别是,指出了当前数据集的某些关键属性,这些属性可能是偏见的潜在来源,损害了对其训练的模型。这些描述对于在这些数据集上的模型构建,根据模型目标选择最佳数据集很有用,以考虑到具体限制,以避免报告过度自信的基准结果,并讨论它们对特定临床环境中概括功能的影响
Machine learning based methods for diagnosis and progression prediction of COVID-19 from imaging data have gained significant attention in the last months, in particular by the use of deep learning models. In this context hundreds of models where proposed with the majority of them trained on public datasets. Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias, hindering the applicability of these models to real-world clinical practice. Considering that datasets are an essential part of model building and evaluation, a deeper understanding of the current landscape is needed. This paper presents an overview of the currently public available COVID-19 chest X-ray datasets. Each dataset is briefly described and potential strength, limitations and interactions between datasets are identified. In particular, some key properties of current datasets that could be potential sources of bias, impairing models trained on them are pointed out. These descriptions are useful for model building on those datasets, to choose the best dataset according the model goal, to take into account the specific limitations to avoid reporting overconfident benchmark results, and to discuss their impact on the generalisation capabilities in a specific clinical setting