Dataset Documentation Recommendations
1 - Dataset Documentation Recommendations
These recommendations are in two parts:
Dataset Documentation Recommendations (this page)
Dataset Use Recommendations (also accessible using the button at the bottom of this page)
The primary purpose of the Dataset Documentation Recommendations is to promote reporting of aspects of a dataset which may introduce bias. The goal is to provide sufficient information so that users of the data, as well as third-party evaluators and auditors (such as regulators and policymakers), can appraise the datasets and make an informed decision on the risk of bias when applied to a specific use case. These are generic considerations and therefore all items below should be considered applicable to all datasets. Where certain information is not available or cannot be reported, this should be explicitly stated.
Click the sections below to view the draft recommendations.
1.1 - DATASET SUMMARY
1.1a - Dataset summary
Dataset curators should provide documentation for datasets they produce. This documentation should include a description of the contents, source and purpose of the dataset, and should be written in accessible language. The summary should help data users assess whether the dataset meets their needs.
1.1b - Dataset identity and access
Dataset documentation should include: dataset name, accessibility, date of release, version, licensing arrangements, and details of the data custodian(s). Where possible this documentation should adhere to FAIR principles.1
1.1c - Motivations for dataset creation and intended purpose(s)
Dataset documentation should include the reasons why this dataset was created, including any intended benefit(s), any purposes for which dataset use should be avoided, who created the dataset, and who funded it.
1.1d - Assumptions and preconceptions of the dataset curation team
Dataset documentation should describe how the curation team has considered the impact of their prior assumptions and preconceptions on biases in the dataset. This may include reflecting on the experiences of the dataset curators themselves, as well as any advice from governing and consultation groups (e.g. advisory boards, patient and public involvement and engagement groups).
1.1e - Origin and purpose of source data
Dataset documentation should describe the original source of data (e.g., patient records to provide clinical care, clinical trial, biobank) and what individuals were expecting to happen to their data (e.g., administrative action, participant in a research study).
1.1f - Data sampling, and aggregation from multiple sources
Dataset documentation should describe how data were sampled from the original data source, including an explanation of sampling strategies and their rationale. If the dataset has been compiled from multiple data sources, dataset documentation should describe how datasets were selected, and how decisions were made during data aggregation, particularly in the case of grouping populations and modification of demographic coding.
1.1g - Data shifts
For longitudinal datasets or datasets with versions, dataset documentation should describe any known or suspected changes over time relating to the population, medical practice, or how data were collected, which may contribute to data shifts.
1.2 - RECORDING OF ATTRIBUTES OF INDIVIDUALS
1.2a - Composition of populations
Dataset documentation should:
Include a summary of the populations present in the dataset. The choice of which populations to describe, and the choice of grouping/categorisation, should be explained.
Highlight any known missing groups within the dataset and any reason(s) for their missingness.
1.2b - Recording of attributes of individuals
Dataset documentation should:
Describe how and why attributes are provided in the dataset (self-reported by participants, imputed, linked from other datasets), and whether this information is available at the individual or aggregate level.
Explain whether attributes have been coded, condensed or modified, stating how and why this was done.
Highlight the proportion of attributes recorded as ‘unknown’ or ‘other’, and if possible explain the reasons why.
1.2c - Groups at particular risk of harm
Dataset documentation should:
Always include data (when available) on certain attributes (including age, gender identity, sex, race, ethnicity, socioeconomic status, and sexual orientation), due to known associations with health outcomes and interactions with wider social factors. If including these data may place individuals at risk of identification or endanger them, these data should instead be provided at aggregate level for the whole dataset. If data on these particular attributes are missing, reasons for this should be stated.
Highlight the presence of any vulnerable population groups in this dataset, with consideration of both vulnerabilities that are universal (e.g., children, people with severe disabilities, displaced persons) and those that are specific to the site of data collection (e.g, marginalised religious or caste groups).
1.3 - SOURCES OF BIAS
1.3a - Modifications made to the data
Dataset documentation should describe whether any data items were modified from the original source, providing the rationale for doing so and any methods used. For example; for anonymisation, to correct for imbalance, to correct errors or biases, or to enable mapping to existing data standards.
1.3b - Limitations of the dataset
Dataset documentation should identify known or suspected sources of bias, error or other factors that affect the dataset as a whole, which may impact its generalisability or applicability for other use.
1.3c - Missing data
Dataset documentation should describe the proportion, nature and causes of missing data (if known), particularly if there are systematic differences across relevant population groups. Documentation should also describe if missing data have been identified and how they have been handled (e.g. imputation, correction).
1.3d - Errors in the data
Dataset documentation should:
Describe how errors can be/have been identified in the data and how they have been handled (e.g. have they been removed, modified, corrected or left in the dataset).
Provide an estimation of the proportion of errors that are present and whether they are more prevalent in some population groups than others.
Provide possible reasons for any systematic differences in error rates across population groups within the dataset.
1.3e - Known or potential bias in data generation
Dataset documentation should:
Describe how bias may be introduced by the acquisition and processing of data within the dataset. For example: from the use of devices, sensors and software.
Highlight any known or potential differences in data acquired across different population groups, or any uncertainty of measurements within population groups.
Describe any attempts to mitigate these biases.
1.3f - Known or potential bias in data collection
Dataset documentation should:
Identify areas where bias may have been introduced into the data collection process. For example: only collecting data from one geographical area, context regarding healthcare coverage and accessibility, only using questionnaires in English.
Describe any attempts to mitigate these biases.
1.3g - Known or potential bias in data labels
Dataset documentation should:
Provide a description of any data labels, including who decided what labels to include, what they were called, and how they were generated.
Highlight labels that are at high risk of bias. For example, where label generation was at the discretion of individuals, where known biases in labelling behaviour has been evidenced previously, or in the use of proxy variables (e.g., healthcare costs as a proxy of healthcare needs)
Describe any attempts to mitigate these biases.
1.4 - ETHICS AND DATA GOVERNANCE
1.4a - Ethics, governance, and quality assurance
Dataset curators should state in their documentation whether data protection laws specific to their jurisdiction have been adhered to. Dataset documentation should also:
Describe measures taken to protect the identities of individuals.
Describe permissions obtained to enable dataset curation, and details of the governance of the dataset.
Provide references to institutional review board/ethical committee review (or equivalent, as appropriate).
Reference standards (e.g. ISO2, FAIR1) which have been adhered to.
1.4b - Patient and public involvement and engagement
Dataset documentation should:
Describe the role of any advisory boards and patient and public involvement and engagement groups in the dataset curation.
Provide information on any efforts to share data and findings with those who contributed to the dataset and any feedback gathered from participants that is relevant to data interpretation.
1.4c - Bias and impact assessments
If a formal assessment of bias, fairness or societal impact has been previously conducted on the dataset, dataset documentation should provide these assessments and results. This may include algorithmic impact assessments (AIAs)3, data protection impact assessments (DPIAs)4, equality impact assessments5, documentation tools, risk of bias assessments or automated toolkits.
Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. See also: https://www.go-fair.org/fair-principles/
International Organization for Standardization (ISO) [internet]. Available from: https://www.iso.org/standards.html
Goves L, Brennan J, Peppin A, Strait A. Algorithmic impact assessment: a case study in healthcare [internet]. Ada Lovelace Institute, UK; 2022. Available from:
https://www.adalovelaceinstitute.org/report/algorithmic-impact-assessment-case-study-healthcare/Data protection impact assessments [internet]. Information Commissioner’s Office, UK; 2022. Available from: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/accountability-and-governance/data-protection-impact-assessments/
Pyper D. The Public Sector Equality Duty and Equality Impact Assessments. House of Commons Library, UK; 2020. Available from:
https://commonslibrary.parliament.uk/research-briefings/sn06591/