The Data Gap



To build artificial intelligence (AI) healthcare technologies that benefit all patients, we need diverse and representative data to be accessible.

As part of the STANDING Together project, we reviewed datasets in different disease areas (COVID-19, breast cancer and heart failure) and found that many datasets do not report individuals’ attributes, and where they are reported, there is a significant lack of diversity.

Part of the problem is that reporting decisions made during dataset curation and use is complex: What should be included? How should decisions be described? What should be defined? 

The STANDING Together recommendations provide guidance for dataset documentation to encourage transparent reporting of ‘who’ is represented, ‘how’ people are represented and how data is used when developing AI technologies for healthcare.

Dots of different skin tones, with some arrows connecting the dots, but some arrows unable to connect to a dot because there are empty spaces where a dot should be

The Reviews

Here we provide a summary of findings from each review, with a catalogue of datasets identified

The Breast Cancer Dataset Review

What did we find? We were able to access 11 mammography datasets. There were over 2.5 million mammograms in total, with 2 million of them being from one UK dataset. The datasets were mainly from high-income settings and there was a lack of reporting of individuals’ attributes. Only three datasets reported race/ethnicity and none provided clear definition for these categories within their dataset documentation. Ten of the datasets were open access whereas for the final dataset, we needed to submit a data access request (regulated access). For this regulated access dataset, we were able to ask questions and clarify when dataset documentation was unclear, meaning we could gain a greater understanding of the data. 

What does this mean? Many mammography datasets do not transparently report 'who' is represented and 'how' they are represented meaning developers of AI technologies using these datasets are less able to understand the limitations of the data. We found value in being able to contact dataset curators to clarify our understanding of the data but there is work to be done in making large, diverse and representative mammography datasets accessible to allow for development of AI technologies that are safe and effective for everyone.


Click here to see the full list of datasets identified as part of this review

The COVID-19 Dataset Review

What did we find? There were 119 COVID-19 datasets identified which represented individuals from 28 different countries. Seventeen datasets included data from different geographical areas. A large proportion of the datasets identified were combined (i.e. they were formed from several, smaller datasets) and eight of these were derived from a chest x-ray dataset which contained images of children - often, it was only through review of the images within the dataset that this was noted as there was limited reference to there being paediatric data within dataset documentation. Most of the datasets included chest x-ray images and some included CT thorax images. One dataset included brain MRI images. There was a distinct lack of reporting of individuals' attributes: only 35% reported age, 31% reported sex/gender and 8% reported race/ethnicity of individuals included in the dataset.

What does this mean? The rapid development of datasets in response to the COVID-19 pandemic meant that curators focused on quantity rather than quality of data. Documentation for COVID-19 datasets is limited which means AI developers are less able to understand the data they are using. A lack of dataset documentation is particularly problematic where datasets have been combined -  there is a risk that data quality is compromised yet users are unlikely to know where the issues lie if they are not documented. There is a need for clear documentation about how datasets are curated and who is included within them.


Click here to see the full list of datasets identified as part of this review

The Heart Failure Dataset Review

What did we find? Within the 28 accessible datasets identified, there are over 1,080,000 individuals represented from 14 different countries - however,  over 1,000,000 of those individuals are accounted for by the UK and USA alone. Data available is varied: the most common being ECG data, health record data and echocardiography. Nine datasets reported race/ethnicity (32%) and six (21%) reported socioeconomic status. No datasets reported definitions for demographic terms except for socioeconomic status where explanations were given for how the groups were stratified.  The methods for determining socioeconomic status varied greatly across different datasets, some used recognised national scoring systems while others used data about employment or education status.

What does this mean? Accessibility of data around individuals' attributes can allow AI developers to understand how their technologies might perform for different groups. However, definitions for different attributes vary depending on context e.g. race and ethnicity mean different things in different places, and socioeconomic status can be summarised in lots of different ways. For data around attributes to be useful, definitions must be provided so users of the data can understand what it is they are assessing.


Click here to see the full list of datasets identified as part of this review

Publications and Presentations

STANDARDS REVIEW AND SURVEY ANALYSIS

The value of standards for health datasets in artificial intelligence-based applications 

The STANDING Together Working Group, Nature Medicine, October 2023.

Summary by Dr Anmol Arora

This research article describes the outcome of our standards review and survey which led to the initial drafting of the recommendations. Click here to read a summary of the paper.


The full paper can be found at:  https://doi.org/10.1038/s41591-023-02608-w  

CORRESPONDENCE

The STANDING Together Working Group, Nature Medicine, September 2022.

To launch the first round of the STANDING Together Delphi Study, an announcement paper was published in Nature Medicine in September 2022. 


The full paper can be found at:  https://doi.org/10.1038/s41591-022-01987-w

Close-up photo of an eye

The availability of health datasets has accelerated digital health research. Ophthalmology has been one of the leading areas of innovation, where several public datasets for ophthalmic imaging have been use in machine learning research. Datasets are a critical component for machine learning algorithm development, hence these need careful scrutiny prior to use. Prior to our review, it was previously unknown how many ophthalmic datasets existed, their degree of accessibility...  Read more

Close-up photo of skin

Freely available (open access) datasets containing skin images are frequently used to develop deep learning algorithms for skin cancer diagnosis. As these algorithms are heavily influenced by the images that they are trained on, it is important that the composition and characteristics of datasets are outlined, such as which populations images are taken from. This information is often...   Read more

PRESENTATION
STANDING Together: STANdards for Data Diversity, Inclusivity and Generalisability

Presented by Dr Joe Alderman at Machine Learning for Healthcare, 2022

Click here to view the poster

Click here to watch the presentation

                             MLHC Poster Reference list

PRESENTATION
Insights from the STANDING Together Consensus Study

Presented by Dr Elinor Laws at The Symposium on Artificial Intelligence for Learning Health Systems, 2023 

Click here to view the poster

25 April 2022   |   NHS Transformation Directorate

21 April 2022   |   STANDING Together 

10 November 2021   |   The Lancet Digital Health

20 October 2021   |   NHSX