How Microsoft’s engineers learned the language of DevOps

Jun 25, 2020   |  

In 2017, the structures, systems, and processes that governed Microsoft Core Services Engineering and Operations (CSEO) were starting to show their age.

The time for change had come.

CSEO, Microsoft’s IT and Operations division, had just begun a transition to modern engineering practices, and transitioning to DevOps was central to those plans. But implementing DevOps, which replaces long, intermittent software delivery cycles with much shorter cycles and continuous delivery, isn’t like flipping a switch.

It’s more like learning a new language.

“We had very disciplined, specific boundaries, and the measures of success were built around that discipline,” says James Gagnon, a principal engineer in CSEO. “Testers were measured on the number of bugs in production, developers were measured on the number of features they delivered.”

Those boundaries provided clarity on expectations for specific engineering roles. But the blended engineering model of DevOps stresses end-to-end accountability, which made those boundaries an obstacle.

“As we move to DevOps, we break down those boundaries,” Gagnon says. We’re less concerned about who does what and more about how things get done. Engineers are now responsible for their entire service, whereas before they were responsible for performing a specific function.”

But unlearning organizational habits instilled and refined over decades doesn’t just happen. Before CSEO leadership could plant the seeds of DevOps, they needed to till the soil. Culture change was the only way forward.

[Learn how Microsoft is modernizing its internal engineering practices. Unpack Microsoft’s cloud-centric approach to transforming its internal architecture. Watch Microsoft’s session at Build on modernizing Microsoft’s engineering practices. Learn how rotating the DevOps role is helping Microsoft improve engineering service quality.]

Creating alignment around vision

The wider transformation had created the impetus to unify CSEO behind a central vision. Codifying that vision would accelerate the transition to DevOps, make the culture change explicit, and get everyone rowing in the same direction.

To that end, the leadership function was consolidated into a single role.

“We no longer had a senior leader for service engineering, security, or tests. We had a senior leader for engineering,” Gagnon says. “That was really key to getting everyone aligned.”

With that accomplished, CSEO could start taking specific steps to ease into the transition to DevOps.

“We started looking at basic metrics to drive the transformation: security hygiene and compliance metrics,” says Mary McHale, a CSEO principal group program manager. “We created a set of metrics around incident response that would transition us to a live site culture where more emphasis was placed on user impact. That included the time it took to acknowledge an incident, the time it took to mitigate an incident, and root cause analyses and post-mortems.”

Once the metrics had been decided, a roadmap was created around those metrics.

“Then—and this is key to our transformation—we met with our leaders every two weeks to talk about our progress toward those metrics,” McHale says. “That level of leaning in was very important for this journey and our cultural change. It was both a bottom-up and top-down movement. Leadership reviewed our progress at an operational level every two weeks and at a strategic level every three months.”

Tackling technical debt strategically

Of course, CSEO wasn’t starting with a blank slate. With so many apps and services under their care, a mountain of technical debt stood between the organization and the realization of the vision. If the debt wasn’t eliminated, it would be an obstacle to future projects, sacrificing agility.

“Our goal was to shift left, which meant we invested in areas that inhibited our ability to move quicker with greater levels of automation, optics, and serviceability, enabling us to fix forward as opposed to rolling back when there was an issue,” Gagnon says.

As an organization, emphasis was placed on debt that affected security, compliance, and productivity, in that order. “We wanted to nail the fundamentals, so we’d have a solid foundation to build on,” McHale says. “We built a pipeline to capture our progress against each of those three areas.”

Most teams funneled 20 to 40 percent of their resources into chipping away at the debt. Some went all-in, such as when the debt affected security. Occasionally, Gagnon says, the debt was so significant that starting from scratch was more efficient, so they retired some legacy platforms.

Creating a culture of autonomy

While metrics of success were well-defined at an organizational level, teams were granted a high degree of autonomy in reaching those goals, Gagnon says.

“We were aligned on the vision. We were aligned on the key outcomes we wanted to deliver on. But teams decided for themselves how they wanted to get where they needed to go,” he says. “They got to create their own identity.”

Each team also set their own deadlines.

“By allowing them to set their own deadlines, we were met with far less resistance,” Gagnon says. “We wanted teams to be empowered, and I think that helped. It created a startup culture mentality.”

Balancing autonomy with accountability

While autonomy increased velocity, accountability provided the necessary guardrails.

“DevOps is all about productivity for developers,” says Martin O’Flaherty, CSEO Azure DevOps owner. “To increase the quality and velocity of our output, we needed to drive accountability.”

A governed service catalog organizes all the services and applications CSEO builds into a hierarchy. That gives CSEO the ability to link production services and code repositories in Microsoft Azure DevOps.

“We spent significant effort cataloging our portfolio of services and components with mandatory metadata such as service owner and PII,” O’Flaherty says. “We ensure there is a hard mapping between our service catalog’s structure and the structure within Azure DevOps. Every area path in Azure DevOps has a team and associated set of repositories giving us a clear link from a service in production to the repository it’s built from in Azure DevOps. This gives a line of accountability both for remediation of compliance scans and incident response.”

At the same time, CSEO engineers were trying to acclimate to a completely new way of working: operations and development were combined into a single function.

“We don’t have designated operators that own production,” says Damon Gray, CSEO software engineering manager. “Instead, the development team owns production and does everything Site Reliability Engineering (SRE) or a service engineering team traditionally handles. The idea is to put the responsibility of owning production from deployment to incidents to root cause analysis as close to the developers and the subject matter experts as possible.”

All this added up to the desired effect: a transformed culture centered around empowerment and accountability.

Accelerating innovation by shifting left

Accountability also accelerated innovation.

“This all coincided with our move to the cloud,” Gray says. “And one of the key changes that really sped up our efforts was accountability for deployment landing on the engineering team.”

One of the benefits of the move to the cloud was robust automation capabilities. “Once we could automate deployment in full, we started looking at infrastructure as code,” Gray says. “Everything became observable and reviewable through a change management pipeline. That accelerated our ability to deliver value incrementally.”

As a result, engineers were empowered to experiment. Because of continuous deployment and active monitoring, Gagnon says, “We can fix errors in production very quickly, often before a customer experiences it. Before, we would have to do a rollback because we didn’t have the automation in place, which was an inhibitor to experimentation.”

The change also led to engineers being more willing to share their mistakes and their learnings.

“We celebrated engineers for sharing what they learned from their mistakes,” Gagnon says. “We acknowledged that behavior publicly. People feel safer when they see vulnerability in practice, and freer to seek out the clarity they need to be successful.”

That made a difference, Gray says.

“Failure is going to happen. It’s how we respond that’s been important, and because of that shift we’ve been able to try new solutions, to innovate, knowing that we have the support of leadership,” he says.

Learn how Microsoft is modernizing its internal engineering practices.

Unpack Microsoft’s cloud-centric approach to transforming its internal architecture.

Watch Microsoft’s session at Build on modernizing Microsoft’s engineering practices.

Learn how rotating the DevOps role is helping Microsoft improve engineering service quality.

Tags: