Network Topologies for Large-scale Datacenters: It’s the Diameter, Stupid!

  • Torsten Hoefler | Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich

We discuss the history and design tradeoffs for large-scale topologies in high-performance computing. We observe that datacenters are slowly following due to the growing demand for low latency and high throughput at lowest cost. We then introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks.

Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.

Speaker Details

Torsten Hoefler directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign.

Torsten has served as the lead for performance modeling and analysis in the US NSF Blue Waters project at NCSA/UIUC. Since 2013, he is professor of computer science at ETH Zurich and has held visiting positions at Argonne National Laboratories, Sandia National Laboratories, and Microsoft Research Redmond (Station Q).

Dr. Hoefler’s research aims at understanding the performance of parallel computing systems ranging from parallel computer architecture through parallel programming to parallel algorithms. He is also active in the application areas of Weather and Climate simulations as well as Machine Learning with a focus on Distributed Deep Learning. In those areas, he has coordinated tens of funded projects and an ERC Starting Grant on Data-Centric Parallel Programming.

He has been chair of the Hot Interconnects conference and technical program chair of the Supercomputing and ACM PASC conferences. He is associate editor of the IEEE Transactions of Parallel and Distributed Computing (TPDS) and the Parallel Computing Journal (PARCO) and a key member of the Message Passing Interface (MPI) Forum.

He has published more than 200 papers in peer-reviewed international conferences and journals and co-authored the latest versions of the MPI specification. He has received best paper awards at the ACM/IEEE Supercomputing Conference in 2010, 2013, and 2014 (SC10, SC13, SC14), EuroMPI 2013, IPDPS’15, ACM HPDC’15 and HPDC’16, ACM OOPSLA’16, and other conferences. Torsten received ETH Zurich’s Latsis Prize in 2015, the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012, the IEEE TCSC Young Achievers in Scalable Computing Award in 2013, the Young Alumni Award 2014 from Indiana University, and the best student award 2005 of the Chemnitz University of Technology. Torsten was elected into the first steering committee of ACM’s SIGHPC in 2013 and he was re-elected in 2016. His Erdős number is two (via Amnon Barak) and he is an academic descendant of Hermann von Helmholtz.