The Autopilot Team

Darryn Dieken, Lijiang Fang, Dmitry Kachan, Pavel Kadach, Randy Kern, Paul Klemond, Dave Maltz, Peter Oosterhof, Darren Shakib, J.J. Stuckey

2013 Outstanding Technical Achievement

By optimizing the deployment and management of servers on a massive scale, the Autopilot team has quietly transformed Internet services at Microsoft.

It’s one of life’s ironies that, when you do things well, everything can run so smoothly that hardly anybody notices. Such has been the fate of Autopilot, a critical yet low-profile service that has quietly and effectively enabled high-profile Bing to enjoy exponential growth. But with the awarding of the 2013 Outstanding Technical Achievement Award, Autopilot finally steps into the spotlight.

“With the Autopilot team, so much of what gets done is deep in the system, and it’s not fully visible,” explains Technical Fellow Darren Shakib. “So it’s great for the team to be recognized for the hard work, and for making a lot of other things possible in the system.”

Sharing Autopilot’s well-deserved recognition with Shakib are fellow team members Randy Kern, Darryn Dieken, Lijiang Fang, Dmitry Kachan, Pavel Kadach, Paul Klemond, Dave Maltz, Peter Oosterhof, and J.J. Stuckey. Together they are responsible for a service that now manages hundreds of thousands of servers across worldwide datacenters.

To appreciate the enormity of the team’s achievement, we have to go way, way back into the dark ages of server management—which is to say, 2003. Microsoft was expanding Search, with Bing somewhere on the horizon. At the time, there were only a few hundred servers to worry about and their management was labor intensive. Installing a 32-bit OS meant a technician carrying a disc from machine to machine, and the network required manual configuration—with frequent outages and live site issues. On the existing infrastructure, application lifecycle was problematic: deployments were done with error-prone scripts and robocopy; bottlenecked operations teams had to troubleshoot and restart everything after failures. Simple upgrades took weeks to succeed.

And yet, Microsoft’s new plans for Search called for a then-unfathomable 5,000 servers to be installed in a single year. Under existing conditions, it was technically unfeasible. “We knew that to do Search, we were going to have to operate at a really high scale in order to get enough servers and get the software to operate on top of that,” says Shakib. In short, something needed to be done.

That something was Autopilot.

Flash forward to today, and Autopilot manages hundreds of thousands of servers. The then-unfathomable prospect of bringing up 5,000 servers in a year is something Autopilot can now exceed in a single week. In 2010 it nearly doubled the number of servers to support the partnership with Yahoo! It topped that increase in 2011, while also decommissioning nearly as many servers—the latter a major feat in and of itself. Hands-on, labor-intensive management is a thing of the past: Autopilot’s application teams are capable of performing hundreds of code deployments and tens of thousands of data deployments per day.

But of course, we do an injustice to the team’s dedication by skipping over a decade of hard work. So let’s look at the origins.

“The company, when we first started out, did not have a lot of service muscle and was not used to working at this scale. A lot of what we did in the early days was trying to figure out what Autopilot needed to do,” recalls Shakib. “We spent a lot of time talking to other people inside the company, did a lot of reading, and learning what other people were doing in the industry. And to be quite honest, even with all that research, the first iteration of Autopilot was pretty crude and had lots of issues. We made some good decisions, we made some bad decisions.

“I think the biggest thing that happened was that we were able to keep iterating rapidly. It gave us the opportunity to have a tight feedback loop and learn rapidly what was working and what wasn’t. It needed to be cheap. It needed to be efficient. It needed to operate at scale. It needed to be agile. Those are the things that kept us going and brought us to where we are today. As we’ve ramped up over the years, the number of servers we need, Autopilot has been on the front line, making sure the hardware was available, that it worked, and that it was as cost-effective as possible.”

The Autopilot team had to embrace many unproven insights to meet its challenges. For example, reliable services had to be built to run on relatively inexpensive commodity servers. Prior to Autopilot, Web properties purchased their own hardware, optimized for their own requirements. This led to a lack of fungibility—teams couldn’t exchange hardware, plans couldn’t change.

A hidden byproduct working with commodity servers was coming to terms with the matter of failure rate. Shakib notes that many have become accustomed to modern hardware being fairly reliable. But the fact is, when you have a huge volume of inexpensive servers, a certain percentage of failures is inevitable. “Getting people used to planning and working on that was a big cultural shift,” says Shakib.

This, says Partner Development Manager Randy Kern, was much of what drove the imperative to automate and remove people from the loop. “An aspect of planning for failure is that you really need to solve these types of problems with software,” he explains. “You can’t expect human operations to handle these things because it is such a constant state of failure mode. Everybody knows it will fail, but when it does, your reaction has to be, OK, this broke, that means we need to go write some more software.”

As the Autopilot nomination statement notes, the team’s innovations span the technology stack. They brought in-house all datacenter hardware specification, with special focus on power efficiency. By working with the datacenter operations team, this enabled datacenter costs to decrease by two-thirds. Standardization brought economies of scale and also fungibility, allowing dynamic allocation between services, thereby driving higher utilization and agility. Autopilot software now completely automates the entire server operational lifecycle, from power on and OS installation, to fault detection and repair, to power cycling and vendor RMA.

When asked about these accomplishments, Shakib and Kern are quick to credit the search engine that fostered Autopilot in the first place. “I can’t stress enough how important Bing has been,” says Kern. “A lot of the ideas we had in Autopilot have only been possible coming from support through the rest of Bing. It would have been very hard for Autopilot to get to where it is today without that support and iteration and learning cycle being there.”

Shakib concurs. “I think the biggest piece by which I would measure our success is the success of the pieces that are running on top of it,” he explains. “Bing has been able to get to the scale and scope that we are, in a very competitive space, because of a lot of what we have been able to do with Autopilot.” Shakib goes on to note that this required an impressive level of trust on Bing’s part. “I can’t necessarily say that Bing is successful because of Autopilot, but Autopilot could definitely have taken Bing down if it had not worked.”

Today, of course, Autopilot extends beyond Bing, and now provides support inside and outside the Online Services Division. “Most of the major groups have used bits and pieces of Autopilot over the years,” says Shakib. “The Windows Live team, the Office team, the Xbox team, the SQL Server team.” Examples include watchdogs, commodity hardware, and of course, designing for failure.

Autopilot also brought to Microsoft the idea of planning computer power capacity on application usage, as opposed to rated machine power capacity, allowing more equipment to run in a facility based on the actual application power consumption, and not on the rated consumption. This, in turn, allows Microsoft to defer construction of new facilities and make better use of each limited power footprint.

After years of operating behind the scenes, the 2013 Outstanding Technical Achievement Award finally brings Autopilot into the spotlight. The team can take a bow for a quietly effective operation that has profoundly transformed Internet-scale services at Microsoft.

Links to more information: Autopilot: Automatic Data Center Management