Machine learning research has traditionally been model-centric, focusing on architectures, parameter optimization, and model transfer. Much less attention has been given to the datasets on which these models are trained, which are often assumed to be fixed, or subject to extrinsic and inevitable change. However, successful application of ML in practice often requires substantial effort in terms of dataset preprocessing and manipulation, such as augmenting, merging, mixing, or reducing datasets.
In this talk I will present some of our recent work that seeks to formalize and automatize these and other flavors of dataset manipulation under a unified approach. First, I will introduce the Optimal Transport Dataset Distance, which provides a fundamental theoretical building block: a formal notion of similarity between labeled datasets. In the second part of the talk, I will discuss how this notion of distance can be used to formulate a general framework of dataset optimization by means of gradient flows in probability space. I will end by presenting various exciting potential applications of this dataset optimization framework.
Learn more about the 2020-2021 Directions in ML: AutoML and Automating Algorithms virtual speaker series: https://aka.ms/diml