Data is central to the development and evaluation of machine learning models. Many responsible AI harms can be traced back to characteristics of datasets. For example, lack of appropriate representation of different groups of people can lead to models that exhibit performance disparities. Spurious correlations and other unanticipated anomalies in training datasets can result in models that fail to generalize. Subjectivity in dataset labels and inaccurate notions of ground truth can result in models with misleading outputs.
Documenting datasets helps promote more deliberate reflection and transparency about how these datasets might affect machine learning models. For dataset creators, documenting your data can help you think through underlying assumptions, potential risks, and implications of use. It can also help dataset consumers—those who will use a dataset to develop or evaluate their models—make informed decisions about whether specific datasets meet their needs, and what limitations they need to consider. For these reasons, good data documentation practices are an essential component of responsible AI.
In 2018, we introduced datasheets for datasets, a tool for documenting the datasets used for training and evaluating machine learning models. Datasheets contain questions about dataset motivation, composition, collection, pre-processing, labeling, intended uses, distribution, and maintenance. Crucially, and unlike other tools for meta-data extraction, datasheets are not automated, but are intended to capture information known only to the dataset creators and often lost or forgotten over time.
To meet the needs of machine learning practitioners in industry, we have continued to pilot and iterate on data documentation frameworks through Microsoft’s Aether Transparency Working Group, resulting in the Aether Data Documentation Template. See our CSCW 2022 paper for more information on studies we’ve run, and download our templates below.