The goal of Project Fiddle is to build systems infrastructure to systematically speed-up distributed deep neural network (DNN) training while eking out the most from the resources used. Specifically, we are aiming for 100x more efficient training. To achieve this goal, we take a broad view of training: from a single GPU, to multiple GPUs on a machine, all the way to multiple machines in a cluster. Our innovations cut across the systems stack from the memory subsystem, to structuring parallel computation, and interconnects between GPUs and machines. Our work has generated interest and led to collaborations with product groups such as Cognitive Toolkit and Cloud Server Infrastructure.
Amar Phanishayee is a Ph.D. candidate at Carnegie Mellon’s Computer Science Department. The goal of his research is to enable the creation of high-performance, efficient networked systems for large-scale data-intensive computing. His research so far has addressed problems across the distributed systems stack: from new hardware to techniques to use it efficiently; from network protocols to distributed systems & consistency protocols. Amar was awarded an IBM Research Fellowship (2009, 2010), a ThinkSwiss Research Scholarship, and a SOSP Best Paper Award in 2009.