Our ability to produce data continues to grow exponentially, and so does the computational power available in the world. By 2021, it is expected that zettabytes will be stored in the cloud, and the DoE will have deployed heterogeneous exascale clusters with millions of CPU cores and tens of thousands of nodes. Resources at this scale have the potential to accelerate scientific breakthroughs, but for that to happen we need to build scalable software systems that will utilize them efficiently.
Today, these clusters confine applications to operate in a non-privileged environment without feedback. I will be discussing our two-pronged effort to improve operational efficiency in this setting by: (a) allowing applications to control mechanisms and observe information traditionally accessible only to the operating system, and (b) using machine learning models to inform users of expected job behavior. This work is a collaboration with the Los Alamos and Argonne National Labs.
George is an Assistant Research Professor at Carnegie Mellon University with a primary appointment in Electrical and Computer Engineering and a courtesy appointment in the Computer Science Department. He is also a member of PDL, the CMU Parallel Data Lab. His research focuses on distributed systems, high performance computing, and data analysis. He co-teaches courses on cloud computing and storage systems. He received his Ph.D. from the University of Toronto.
Principal Research Manager