Dataset Documentation Recommendations

1 - Dataset Documentation Recommendations

These recommendations are in two parts: 


The primary purpose of the Dataset Documentation Recommendations is to promote reporting of aspects of a dataset which may introduce bias. The goal is to provide sufficient information so that users of the data, as well as third-party evaluators and auditors (such as regulators and policymakers), can appraise the datasets and make an informed decision on the risk of bias when applied to a specific use case. These are generic considerations and therefore all items below should be considered applicable to all datasets. Where certain information is not available or cannot be reported, this should be explicitly stated.


Click the sections below to view the draft recommendations.

1.1 - DATASET SUMMARY

1.1a - Dataset summary

Dataset curators should provide documentation for datasets they produce. This documentation should include a description of the contents, source and purpose of the dataset, and should be written in accessible language. The summary should help data users assess whether the dataset meets their needs.



1.1b - Dataset identity and access

Dataset documentation should include: dataset name, accessibility, date of release, version, licensing arrangements, and details of the data custodian(s). Where possible this documentation should adhere to FAIR principles.1



1.1c - Motivations for dataset creation and intended purpose(s)

Dataset documentation should include the reasons why this dataset was created, including any intended benefit(s), any purposes for which dataset use should be avoided, who created the dataset, and who funded it.



1.1d - Assumptions and preconceptions of the dataset curation team

Dataset documentation should describe how the curation team has considered the impact of their prior assumptions and preconceptions on biases in the dataset. This may include reflecting on the experiences of the dataset curators themselves, as well as any advice from governing and consultation groups (e.g. advisory boards, patient and public involvement and engagement groups).



1.1e - Origin and purpose of source data

Dataset documentation should describe the original source of data (e.g., patient records to provide clinical care, clinical trial, biobank) and what individuals were expecting to happen to their data (e.g., administrative action, participant in a research study).



1.1f - Data sampling, and aggregation from multiple sources

Dataset documentation should describe how data were sampled from the original data source, including an explanation of sampling strategies and their rationale. If the dataset has been compiled from multiple data sources, dataset documentation should describe how datasets were selected, and how decisions were made during data aggregation, particularly in the case of grouping populations and modification of demographic coding.



1.1g - Data shifts

For longitudinal datasets or datasets with versions, dataset documentation should describe any known or suspected changes over time relating to the population, medical practice, or how data were collected, which may contribute to data shifts.

1.2 - RECORDING OF ATTRIBUTES OF INDIVIDUALS

1.2a - Composition of populations

Dataset documentation should:



1.2b - Recording of attributes of individuals

Dataset documentation should:



1.2c - Groups at particular risk of harm

Dataset documentation should:

1.3 - SOURCES OF BIAS

1.3a - Modifications made to the data

Dataset documentation should describe whether any data items were modified from the original source, providing the rationale for doing so and any methods used. For example; for anonymisation, to correct for imbalance, to correct errors or biases, or to enable mapping to existing data standards.



1.3b - Limitations of the dataset

Dataset documentation should identify known or suspected sources of bias, error or other factors that affect the dataset as a whole, which may impact its generalisability or applicability for other use.



1.3c - Missing data

Dataset documentation should describe the proportion, nature and causes of missing data (if known), particularly if there are systematic differences across relevant population groups. Documentation should also describe if missing data have been identified and how they have been handled (e.g. imputation, correction).



1.3d - Errors in the data

Dataset documentation should: 



1.3e - Known or potential bias in data generation

Dataset documentation should:



1.3f - Known or potential bias in data collection

Dataset documentation should:



1.3g - Known or potential bias in data labels

Dataset documentation should:

1.4 - ETHICS AND DATA GOVERNANCE

1.4a - Ethics, governance, and quality assurance

Dataset curators should state in their documentation whether data protection laws specific to their jurisdiction have been adhered to. Dataset documentation should also:



1.4b - Patient and public involvement and engagement

Dataset documentation should:



1.4c - Bias and impact assessments

If a formal assessment of bias, fairness or societal impact has been previously conducted on the dataset, dataset documentation should provide these assessments and results. This may include algorithmic impact assessments (AIAs)3, data protection impact assessments (DPIAs)4, equality impact assessments5, documentation tools, risk of bias assessments or automated toolkits.