Dataset Use Recommendations

2 - Dataset Use Recommendations

These recommendations are in two parts: 

The primary purpose of Dataset Use Recommendations is to promote best practice for how datasets should be selected and used throughout the AI and digital health technology lifecycle, including recording how risks to relevant subgroups have been identified and mitigated. They are primarily for Data Users, as they require context and should be considered with a specific use case in mind.


Click the sections below to view the draft recommendations.

2.1 - PROVIDE SUFFICIENT DATASET DOCUMENTATION

2.1a - Provide sufficient information about dataset(s) to allow traceability and auditability

Datasets used in the lifecycle of AI health technologies should be accompanied by documentation which conforms to Dataset Documentation Standards, enabling audit against these standards.

2.2 - EVALUATE IN THE CONTEXT OF RELEVANT SUBGROUPS

2.2a - Identify contextualised subgroups of interest who are particularly at risk of harm from the AI health technology under development

Data Users should identify contextualised subgroups of interest in advance: these are subgroups with shared attributes, identified as being relevant and important for the use case, and where they are known to have worse health outcomes or are subject to other systems driving health inequity related to the use case. Contextualised subgroups of interest may be discovered via multiple sources, including literature review, evidence from the development or use of similar AI health technologies, consultation with experts in health inequity, clinical practice, etc. 



2.2b - Use appropriate datasets to support the intended use population, and intended purpose of the AI health technology

The intended use population should be adequately represented in the datasets used in an AI Health Technology. The contextualised subgroups of interest (see item 2.2a) should also be included where possible, and if not included this should be explicitly stated. Areas of under-representation should be identified and transparently reported.



2.2c - Identify discrepant performance of the AI health technology for contextualised subgroups of interest

Data Users should:

Report performance of the AI Health Technology for contextualised subgroups of interest identified in 2.2a.

Compare performance for contextualised subgroups of interest to aggregate performance in the overall study population.

Report performance of the AI Health Technology for subgroup(s) who have the best pre-existing health outcomes in this clinical area, and compare this to performance for contextualised subgroups of interest.



2.2d - Evaluate performance of the AI health technology for subgroups experiencing vulnerability

If not already addressed by 2.2c, Data Users should report evaluation results across certain attributes (including age, gender identity, sex, race, ethnicity, socioeconomic status and sexual orientation), due to known associations with health outcomes and interactions with wider social factors. This may not always be possible or appropriate, in which case the reasons for not doing so should be documented.

2.3 - ACKNOWLEDGE KNOWN BIASES AND LIMITATIONS OF THE DATASET AND ANY IMPLICATIONS ON THE INTENDED USE OF THE AI HEALTH TECHNOLOGY

2.3a - Report limitations of datasets used, and any implications on the AI health technology

Data Users should report limitations of datasets used, and the implications of these on the target AI health technology. Data Users should investigate whether limitations are systematically different across relevant population subgroups, including those categorised as ‘unknown’ or ‘other’, and report differences which could result in worse performance on the AI Health Technology across groups.



2.3b - Report differences between the intended purposes of the AI health technology and datasets used during development, including the implications of discordance

Data users should report any intended purposes of datasets used (item 1.1c), and how these differ from the intended purpose of the AI Health Technology (item 2.2b). State implications of any discordance and provide justification regarding the suitability of the dataset, including assumptions made and aspects of the dataset which are not directly applicable.



2.3c - Report level of uncertainty for performance in subgroups when sample size is insufficient

Should sufficient sample size not be achieved in minority and/or intersectional subgroups, Data Users should report the level of uncertainty for performance in these subgroups (e.g. with confidence intervals). Where this may suggest additional risk, describe whether mitigation plans are in place to avoid harm to these groups.



2.3d - Report findings of pre-existing dataset assessments

Data Users should review any pre-existing assessments of the datasets which are available (e.g. algorithmic impact assessments1, equality impact assessments2, data protection impact assessments3, Datasheet4, Healthsheet5) and report how the findings may translate to harm for subgroups within the intended use population.

2.4 - ADDRESS UNCERTAINTIES AND RISKS WITH MITIGATION PLANS

2.4a - Address uncertainties and risks with mitigation plans

Where Data Users have identified uncertainty or potentially variable performance in subgroups, any clinical implications resulting from these findings must be clearly stated and reported as risks. The Data User should document plans to monitor these risks as part of the post-market clinical follow-up and post-market surveillance.

Proposed additional item in Dataset Use Recommendations

Early feedback, has suggested adding an additional item to the draft Dataset Use Standards. The suggested wording is below:


Report any statistical approaches (including ‘fairness methods/metrics’) used to intentionally modify performance across subgroups. 

Data users should document any attempts during development and evaluation of the AI health technology, which attempt to make predictions more equitable across subgroups. Describe:

The rationale and goals for doing so.

The methods and metrics used.

How thresholds were set, including whether these vary between subgroups of people.