Hyperscale cloud reliability and the art of organic collaboration
What does it take to build one of the most reliable hyperscale clouds on the planet? It clearly requires astronomical investments and a vast organization that operates at global scale in near seamless coordination. Yet the breakthroughs that fuel this story emerged organically, from a combination of innovation and mentoring relationships that grew into close friendships, and initiative taking—recognizing and seizing opportunities regardless of where they existed in an organization. We will take a look at the relationship between creativity and technology along with the happy coincidences that seed the organic collaboration at Microsoft making a breakthrough possible in a relatively new space, that of network verification and hyperscale cloud reliability in Azure.
This story goes back several years and includes seasoned researchers, fresh talent, and research players across industry and academia. First, some background on the reality of networks in the present day.
It can be said today that the internet is finally fast, something a lot of people worked hard to realize. As the world transitioned to services, everything people do has wound up in the cloud—that global sprawl consisting of millions upon millions of affordable servers, all connected by a network—and the need arose to make networks super reliable. “This was the inflection point,” said George Varghese, Chancellor’s Professor of Computer Science at UCLA, formerly Principal Researcher at Microsoft Research, who played a large role in this story. “Networks went from being a ‘nice to have’ to a must-have, like electricity. You’ve got to have them working.”
Against this backdrop hummed an open culture around the product and research divisions at Microsoft and a community-wide drive in the active area of network verification. But this story is also a personal journey, and one that happens to involve applying symbolic tools to the Azure hyperscale network environment.
“Community leaders with open mindsets have promoted adopting methods traditionally developed in program analysis and verification and applying them to networks. They propose that the entire network can be thought of as a software program, asking, ‘Can we use program analysis to speak to these networks?’” explained Principal Researcher Nikolaj Bjørner of Microsoft Research.
Which brings us to intent-based networking.
Intent-based networking is the idea of automatically driving networks to a desired state from high-level policies. The emergence of hyper-scalable clouds and software-defined network orchestration has brought acute relevance to this area. In recent years, networking professionals have increasingly been embracing network verification to empower complex yet reliable networks.
“The time is ripe for software-defined networking, inspired by design methodologies in software and hardware engineering.” – Nikolaj Bjørner
Network verification tools such as Microsoft open-source Z3 take an abstract view of networks as artifacts that can be analyzed. This may be accomplished through symbolic verification, simulation, or emulation on virtual machines. A group of collaborators at Microsoft decided to apply symbolic tools to use within the Azure hyperscale network environment.
When can you start?
Karthick Jayaraman joined the Microsoft Azure team in 2011 fresh out of graduate school at Syracuse University. During his graduate study, he created the Mohawk symbolic model checker for role-based access control (RBAC) policies. One of his collaborators, professor Vijay Ganesh from the University of Waterloo, was visiting Microsoft Research to deliver an update on his latest research. Wandering over to Microsoft Building 99—Microsoft Research home turf—to greet an old friend, Jayaraman wound up making a new connection; Ganesh introduced Jayaraman to Nikolaj Bjørner of Microsoft Research. It would prove to be a serendipitous encounter. Curious about what Jayaraman’s work at Azure entailed, Bjørner plied him with questions. Jayaraman’s curiosity in return was focused on finding out the best tools that might exist to help him solve challenges in Azure networking.
Jayaraman’s work in Azure exposed him to the challenge of maintaining network ACLs—policies that determine which traffic may flow in and out of a network. Network ACLs are ubiquitous in network management, used to protect private and corporate networks. They are used to limit what devices can talk to each other. Policies are captured as rules that operate at the level of bits that are set in packet headers, such as source and destination IP addresses. When a new set of servers is deployed with new IP addresses, network policies are updated to account for the new endpoints. But with policies described as a sequence of bit patterns, the complexity of manually updating thousands of devices is elusive to even the grandest master of complexity.
“Microsoft is a fantastic place to work! A few months into my job, I was working with Nikolaj, an eminent researcher, on applying software verification techniques for an important operational problem of managing firewall policies in the cloud. I couldn’t have asked for more as a college hire.” – Karthick Jayaraman
Jayaraman sensed that tools designed with verifying software in mind could be applied to the software-defined networks of Azure. He began to include colleagues from Microsoft Research in processes traditionally used for managing networks. A lasting professional partnership and camaraderie was formed.
“Nikolaj took a personal interest in mentoring me,” recalled Jayaraman. “During my years at Microsoft, much has changed; but what hasn’t changed is my relationship with Nikolaj and the kinds of problems we solve together, and this speaks volumes about Microsoft culture in general and highlights how these organic relationships and opportunity moments are the common threads.”
Connecting network policies to logical formulas that capture their intent has been a recurring topic in networking research, but prior tools were ad hoc, written for a specialized format and hardly extensible. Microsoft Research’s satisfiability modulo theories solver Z3 is a state-of-the-art theorem prover that is specifically tailored to capture domains that are found commonly in software and hardware descriptions. It is used prominently in software verification, testing, and analysis. By 2012, network verification was an area of nascent interest, and the Azure network was going through early stages of build-out. It quickly became evident that ACLs could be expressed directly as logical formulas and that the machinery that reasons about such formulas was well-suited and sufficiently efficient for checking properties of ACLs. Jayaraman and Bjørner developed the SecGuru tool, replacing manual what-if policy reviews with automated analysis for ACL updates. Despite their ostensible job titles and roles, the two developed a habit of meeting Sunday afternoons at cafés in nearby Woodinville or in the Commons on campus to exchange technical ideas. SecGuru rolled out, first as a service to catch update errors in existing firewall policies in Azure datacenters and later as a part of the massive rollout process for new Azure containers. It identified a sweet spot in network verification—the tool provided a library of routines to ingest and maintain security policies. Because policies are managed at several different levels—within datacenters, on customer virtual networks, and on all traffic that enters and leaves corporate networks—SecGuru found a multitude of uses. Over the years, it adapted to existing scenarios and new versions were developed. With SecGuru as an inspiration, their colleague Andrew Helwer developed a self-contained tool for Windows firewalls, now open source as part of the Z3 GitHub repository. It is a good example of an advanced network verification tool one can develop using Z3.
“If you view a whole network as a program, can you use program analysis to tweak these networks?” – George Varghese
Meanwhile, George Varghese had recently joined the Networking Research group at Microsoft when he reached out to peers from software engineering. Varghese, along with collaborators from Stanford University, had developed a method for verifying flow properties in networks by using header space data structures that could efficiently simulate networks on a massive number of flows. It was an exciting development that would form the basis of the Palo Alto-based Forward Networks intent-based networking startup and represented a clear example of what could be achieved by exploiting confluences between techniques from design automation and software-defined networking. Varghese created awareness of this insight with peers in the networking and verification communities; this led to several collaborations, including a close partnership with Bjørner. Also drawn to the potential of the research was Nuno Lopes, an intern at the time and now a researcher at Microsoft Research Cambridge in the U.K. Lopes’ internship project had created a general-purpose engine that could perform efficient header space operations inspired by Varghese’s earlier work. The engine involved Cambridge-based researcher Andrey Rybalchenko and professor Gordon Plotkin of The University of Edinburgh, well-known for seminal work on programming language semantics. Plotkin instigated exploiting symmetries for scaling network verification.
Symmetries are a basis for process calculi used to model concurrency, and they are inherent in cloud networks built with a redundant set of routers. The networking research community continues to embrace network verification with a welcoming culture, whether for data-plane validation, control plane verification, verifying programs that perform packet processing—noteworthy in the P4 programming language—or verifying networking stacks. Several startups in the network verification space have emerged, including Veriflow and Intentionet, addressing needs of enterprises for administrating their global networks, and other major cloud providers are now also embracing network verification.
“The joy is, first of all, it’s useful; it helps Azure avoid bugs and keep going. But it’s also intellectually stimulating because it’s not simply using verification tools that already exist, but it’s inventing new ones specialized for networking. And Nikolaj and Nuno and Andrey and Gordon and I have all enjoyed that a lot. We contributed not just to engineering but to science,” said Varghese.
“George has inspired and mentored me in so many respects. He made the point that throughout history, when somewhat unrelated areas of fields are brought together, a new starting point is created,” said Bjørner.
Varghese used the concept of abstract art as an analogy, pointing out how the development of photography influenced the painters of the day in that they began to find the accurate capturing of nature less appealing, giving rise to the new genre. Such stories also exist in technology, said Bjørner, and the arrival of a new technology will see two parties benefiting from such confluences. “The networking world benefits from verification methodologies to understand what it means to build these reliable networks with guarantees. The verification community in turn is learning tricks of the trade and domain-specific trades that will teach us what it means to build efficient tools or create efficient algorithms for these networks.”
Meanwhile, SecGuru had been extended to verifying network configurations beyond ACLs, checking reachability properties in Azure datacenters. SecGuru scales these checks to 100Ks routers, involving trillions of network paths. The checks are done on demand and in a matter of minutes for all Azure routers. It owes the efficiency to a domain insight Jayaraman developed when reviewing the datacenter architecture of Azure. The architecture uses a highly regular structure in which each router has a fixed role for a configurable set of address ranges. It suffices verifying that the roles are enforced on each router and global reachability properties follow as consequences of overall network properties. The issues that the verifier unearths include configuration bugs, software bugs, faulty hardware, and errors introduced during troubleshooting incidents. The sources for intent drift are varied and beyond the control of any single tool, but network verification can catch such drift regardless of its root cause.
“Assuring all changes and that the reality of the network always preserves our intent is critical for reliability. Reliability is critical for earning customer trust.” – Karthick Jayaraman
Azure cloud services exist in an ecosystem of constant change. Daunting amounts of new capacity are constantly deployed along with constantly emerging new technologies and infrastructure. It is an exciting time for networking research and synergies with formal methods. New uses for formal methods applied to network and configuration verification emerge. As a fresh hire, Jayaraman forged collaborations with researchers from different backgrounds and helped fuel cloud-wide innovation. It’s a testament to a culture at Microsoft that fosters collaboration and encourages creativity. Stay tuned.