Characteristics of publicly available skin cancer image datasets: a systematic review

Dr David Wen

Freely available (open access) datasets containing skin images are frequently used to develop deep learning algorithms for skin cancer diagnosis. As these algorithms are heavily influenced by the images that they are trained on, it is important that the composition and characteristics of datasets are outlined, such as which populations images are taken from. This information is often detailed in metadata accompanying individual datasets.


Prior to this review, the total number of publicly available datasets globally and their respective content had not previously been explored. We searched for open access skin image datasets used to develop machine learning algorithms for skin cancer diagnosis and systematically evaluated their characteristics including associated metadata.


We identified 21 open access datasets containing 106,950 freely available images. With regards to their general characteristics: 


Regarding metadata reporting for individual images in the open access datasets: 


Our review highlighted that better reporting of dataset characteristics and metadata is required with the aim of producing more transparent skin image datasets. Quality standards outlining what should be reported in datasets may facilitate this through providing guidance for dataset curators. Dataset standards can also detail what constitutes a representative dataset and who should be included, with the aim of producing datasets that are representative of the target populations that any developed algorithms will be deployed in, translating into more effective algorithms for all groups of people. 

Bar charts showing proportion of datasets reporting dataset and metadata characteristics.

Figure 3 from the paper: Characteristic and metadata reporting of open access datasets and images

Read the paper here

Source: Wen D, Khan S, Xu A, Ibrahim H, Smith L, Caballero J et al. Characteristics of publicly available skin cancer image datasets: a systematic review. The Lancet Digital Health. 2021.