The Data Gap
To build artificial intelligence (AI) healthcare technologies that benefit all patients, we need diverse and representative data to be accessible.
As part of the STANDING Together project, we reviewed datasets in different disease areas (COVID-19, breast cancer and heart failure) and found that many datasets do not report individuals’ attributes, and where they are reported, there is a significant lack of diversity.
Part of the problem is that reporting decisions made during dataset curation and use is complex: What should be included? How should decisions be described? What should be defined?
The STANDING Together recommendations provide guidance for dataset documentation to encourage transparent reporting of ‘who’ is represented, ‘how’ people are represented and how data is used when developing AI technologies for healthcare.
The Reviews
Here we provide a summary of findings from each review, with a catalogue of datasets identified
The Breast Cancer Dataset Review
What did we find? We were able to access 11 mammography datasets. There were over 2.5 million mammograms in total, with 2 million of them being from one UK dataset. The datasets were mainly from high-income settings and there was a lack of reporting of individuals’ attributes. Only three datasets reported race/ethnicity and none provided clear definition for these categories within their dataset documentation. Ten of the datasets were open access whereas for the final dataset, we needed to submit a data access request (regulated access). For this regulated access dataset, we were able to ask questions and clarify when dataset documentation was unclear, meaning we could gain a greater understanding of the data.
What does this mean? Many mammography datasets do not transparently report 'who' is represented and 'how' they are represented meaning developers of AI technologies using these datasets are less able to understand the limitations of the data. We found value in being able to contact dataset curators to clarify our understanding of the data but there is work to be done in making large, diverse and representative mammography datasets accessible to allow for development of AI technologies that are safe and effective for everyone.
Click here to see the full list of datasets identified as part of this review
The COVID-19 Dataset Review
What did we find? There were 119 COVID-19 datasets identified which represented individuals from 28 different countries. Seventeen datasets included data from different geographical areas. A large proportion of the datasets identified were combined (i.e. they were formed from several, smaller datasets) and eight of these were derived from a chest x-ray dataset which contained images of children - often, it was only through review of the images within the dataset that this was noted as there was limited reference to there being paediatric data within dataset documentation. Most of the datasets included chest x-ray images and some included CT thorax images. One dataset included brain MRI images. There was a distinct lack of reporting of individuals' attributes: only 35% reported age, 31% reported sex/gender and 8% reported race/ethnicity of individuals included in the dataset.
What does this mean? The rapid development of datasets in response to the COVID-19 pandemic meant that curators focused on quantity rather than quality of data. Documentation for COVID-19 datasets is limited which means AI developers are less able to understand the data they are using. A lack of dataset documentation is particularly problematic where datasets have been combined - there is a risk that data quality is compromised yet users are unlikely to know where the issues lie if they are not documented. There is a need for clear documentation about how datasets are curated and who is included within them.
Click here to see the full list of datasets identified as part of this review
The Heart Failure Dataset Review
What did we find? Within the 28 accessible datasets identified, there are over 1,080,000 individuals represented from 14 different countries - however, over 1,000,000 of those individuals are accounted for by the UK and USA alone. Data available is varied: the most common being ECG data, health record data and echocardiography. Nine datasets reported race/ethnicity (32%) and six (21%) reported socioeconomic status. No datasets reported definitions for demographic terms except for socioeconomic status where explanations were given for how the groups were stratified. The methods for determining socioeconomic status varied greatly across different datasets, some used recognised national scoring systems while others used data about employment or education status.
What does this mean? Accessibility of data around individuals' attributes can allow AI developers to understand how their technologies might perform for different groups. However, definitions for different attributes vary depending on context e.g. race and ethnicity mean different things in different places, and socioeconomic status can be summarised in lots of different ways. For data around attributes to be useful, definitions must be provided so users of the data can understand what it is they are assessing.
Click here to see the full list of datasets identified as part of this review
The value of standards for health datasets in artificial intelligence-based applications
The STANDING Together Working Group, Nature Medicine, October 2023.
Summary by Dr Anmol Arora
This research article describes the outcome of our standards review and survey which led to the initial drafting of the recommendations. Click here to read a summary of the paper.
The full paper can be found at: https://doi.org/10.1038/s41591-023-02608-w
The STANDING Together Working Group, Nature Medicine, September 2022.
To launch the first round of the STANDING Together Delphi Study, an announcement paper was published in Nature Medicine in September 2022.
The full paper can be found at: https://doi.org/10.1038/s41591-022-01987-w
The availability of health datasets has accelerated digital health research. Ophthalmology has been one of the leading areas of innovation, where several public datasets for ophthalmic imaging have been use in machine learning research. Datasets are a critical component for machine learning algorithm development, hence these need careful scrutiny prior to use. Prior to our review, it was previously unknown how many ophthalmic datasets existed, their degree of accessibility... Read more
Summary by Dr David Wen
Freely available (open access) datasets containing skin images are frequently used to develop deep learning algorithms for skin cancer diagnosis. As these algorithms are heavily influenced by the images that they are trained on, it is important that the composition and characteristics of datasets are outlined, such as which populations images are taken from. This information is often... Read more
STANDING Together: STANdards for Data Diversity, Inclusivity and Generalisability
Presented by Dr Joe Alderman at Machine Learning for Healthcare, 2022
Click here to view the poster
Click here to watch the presentation
MLHC Poster Reference list
Artificial Intelligence - how to get it right. NHSX. October 2019. Available from URL: https://transform.england.nhs.uk/media/documents/NHSX_AI_report.pdf
Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health. 2019 Oct 1;1(6):e271–97. Available online at: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(19)30123-2/fulltext
Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019 Oct 25;366(6464):447–53. Available online at: https://www.science.org/doi/10.1126/science.aax2342
Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021 Dec;27(12):2176–82. Available online at: https://www.nature.com/articles/s41591-021-01595-01.
The health of people from ethnic minority groups in England [Internet]. The King’s Fund. 2021 [cited 2022 Apr 12]. Available from: https://www.kingsfund.org.uk/publications/health-people-ethnic-minority-groups-england1.
Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, et al. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. The Lancet Digital Health. 2021 Jan 1;3(1):e51–66. Available online at: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30240-5/fulltext1.
Wen D, Khan SM, Xu AJ, Ibrahim H, Smith L, Caballero J, et al. Characteristics of publicly available skin cancer image datasets: a systematic review. The Lancet Digital Health. 2022 Jan 1;4(1):e64–74. Available online at: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(21)00252-1/fulltext1.
Ibrahim H, Liu X, Zariffa N, Morris AD, Denniston AK. Health data poverty: an assailable barrier to equitable digital health care. The Lancet Digital Health. 2021 Apr 1;3(4):e260–5. Available online at: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30317-4/fulltext1.
Parikh RB, Teeple S, Navathe AS. Addressing Bias in Artificial Intelligence in Health Care. JAMA. 2019 Dec 24;322(24):2377–8. Available online at: https://jamanetwork.com/journals/jama/article-abstract/2756196 1.
Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical Machine Learning in Healthcare. Annu Rev Biomed Data Sci. 2021 Jul;4:123–44. Available online at: https://www.annualreviews.org/doi/10.1146/annurev-biodatasci-092820-114757 1.
Bennett JE, Pearson-Stuttard J, Kontis V, Capewell S, Wolfe I, Ezzati M. Contributions of diseases and injuries to widening life expectancy inequalities in England from 2001 to 2016: a population-based analysis of vital registration data. The Lancet Public Health. 2018 Dec 1;3(12):e586–97. Available online at: https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(18)30214-7/fulltext 1.
Marmot M, Allen J, Boyce T, Goldblatt P, Morrison J. Health Equity in England: The Marmot Review 10 Years On - The Health Foundation [Internet]. London: Institute of Health Equity; 2020 [cited 2022 Apr 13]. Available online at: https://www.health.org.uk/publications/reports/the-marmot-review-10-years-on
Insights from the STANDING Together Consensus Study
Presented by Dr Elinor Laws at The Symposium on Artificial Intelligence for Learning Health Systems, 2023
Click here to view the poster
25 April 2022 | NHS Transformation Directorate
21 April 2022 | STANDING Together
10 November 2021 | National Cancer Research Institute
10 November 2021 | The Lancet Digital Health
22 October 2021 | University of Birmingham
20 October 2021 | The Guardian
20 October 2021 | NHSX