Datasheets for Datasets

Datasheets for datasets is a tool for documenting the datasets used for training and evaluating machine learning models. The aim of datasheets is to increase dataset transparency and facilitate better communication between dataset creators and dataset consumers (e.g., those using datasets to train machine learning models). Datasheets encourage dataset creators to carefully reflect on the dataset creation process, enabling them to uncover possible sources of bias in their data or unintentional assumptions that they’ve made. For dataset consumers, the information contained within datasheets can help ensure that the dataset is the right choice for the task at hand. Datasheets can optionally be exposed to end users for increased transparency and trust. Datasheets contain questions about dataset motivation, composition, collection, pre-processing, labeling, intended uses, distribution, and maintenance. Crucially, and unlike other tools for meta-data extraction, datasheets are not automated, but are intended to capture information known only to the dataset creators and often lost or forgotten over time.

Get the Datasheet template >