Flex: High-Availability Datacenters With Zero Reserved Power

  • Chaojie Zhang
  • Ioannis Manousakis
  • Deli Zhang
  • Rod Assis
  • Kyle Woolcock
  • Nithish Mahalingam
  • Brijesh Warrier
  • David Gauthier
  • Lalu Kunnath
  • Steve Solomon
  • Osvaldo Morales
  • Marcus Fontoura

Proceedings of the International Symposium on Computer Architecture (ISCA'21) |

Cloud providers, like Amazon and Microsoft, must guarantee high availability for a large fraction of their workloads.  For this reason, they build datacenters with redundant infrastructures for power delivery and cooling.  Typically, the redundant resources are reserved for use only during infrastructure failure or maintenance events, so that workload performance and availability do not suffer.  Unfortunately, the reserved resources also produce lower power utilization and, consequently, require more datacenters to be built.  To address these problems, in this paper we propose “zero-reserved-power” datacenters and the Flex system to ensure that workloads still receive their desired performance and availability.  Flex leverages the existence of software-redundant workloads that can tolerate lower infrastructure availability, while imposing minimal (if any) performance degradation for those that require high infrastructure availability.  Flex mainly comprises (1) a new offline workload placement policy that reduces stranded power while ensuring safety during failure or maintenance events, and (2) a distributed system that monitors for failures and quickly reduces the power draw while respecting the workloads’ requirements, when it detects a failure.  Our evaluation shows that Flex produces less than 5% stranded power and increases the number of deployed servers by up to 33%, which translates to hundreds of millions of dollars in construction cost savings per datacenter site.  We end the paper with lessons from our experience bringing Flex to production in Microsoft’s datacenters.