论文标题

皮肤病学AI表现的差异,在多样的,精心策划的临床图像集上

Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

论文作者

Daneshjou, Roxana, Vodrahalli, Kailas, Novoa, Roberto A, Jenkins, Melissa, Liang, Weixin, Rotemberg, Veronica, Ko, Justin, Swetter, Susan M, Bailey, Elizabeth E, Gevaert, Olivier, Mukherjee, Pritam, Phung, Michelle, Yekrang, Kiana, Fong, Bradley, Sahasrabudhe, Rachna, Allerup, Johan A. C., Okata-Karigane, Utako, Zou, James, Chiou, Albert

论文摘要

获得皮肤病学是一个主要问题,估计有30亿人无法获得全球护理。人工智能(AI)可能有助于分类皮肤疾病。但是,大多数AI模型尚未在各种肤色或罕见疾病的图像上进行严格评估。为了确定在这种情况下算法性能的潜在偏见,我们策划了不同的皮肤病学图像(DDI)数据集 - 首次公开可用,经过专业策划和具有多种肤色的病理确认的图像数据集。使用此数据集的656张图像,我们表明,与模型的原始测试结果相比,曲线下的最新皮肤化AI模型在DDI上的性能差得多,而接收器操作员曲线面积(ROC-AUC)下降了27-36%。所有模型在DDI数据集中表示的深色肤色和罕见疾病上的表现较差。此外,我们发现通常为AI培训和测试数据集提供视觉标签的皮肤科医生在深色皮肤和不常见疾病的图像上的表现较差,与地面真相活检注释相比。最后,在良好的和多样的DDI图像上对AI模型进行了微调AI模型,缩小了浅色和深色肤色之间的性能差距。此外,对各种肤色的算法微调的算法优于皮肤科医生,在识别深色肤色图像上的恶性肿瘤方面优于皮肤病学家。我们的发现确定了需要解决的皮肤病学中重要的弱点和偏见,以确保对各种患者和疾病的可靠应用。

Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. Artificial intelligence (AI) may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and diverse DDI images closed the performance gap between light and dark skin tones. Moreover, algorithms fine-tuned on diverse skin tones outperformed dermatologists on identifying malignancy on images of dark skin tones. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and diseases.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源