It was time to modernize monitoring as part of Microsoft’s transition to a DevOps engineering model, and it was Dana Baxter who got the nod to lead the effort.
“At Microsoft, we want to move to native Azure services,” says Baxter, a senior service engineer in Microsoft Core Services Engineering and Operations (CSEO). Baxter’s team is centralized and provides monitoring and management operations to support infrastructure for the internal applications at Microsoft.
It was January of 2017 when Baxter got the call. “Our goal was to move to a native and fully managed solution on Azure that would free us up from managing our infrastructure,” she says. “To do this, we started by analyzing our System Center Operations Manager (SCOM) environment.”
The team had built SCOM alerts based on standard performance and common events known to create outages or issues with user experience. But, there were also lots of custom alerts that had been piled on over the years as the team dealt with unexpected situations.
“It was the kind of thing where, six years ago, an outage happened, and a manager would insist we had to build alerts so this wouldn’t happen again,” Baxter says. “Those alerts build up over many years and you have a garbage dump of alerts that never get used again, and now you have to sift through all of that technical debt.”
Baxter’s initial step was to focus on cleaning up the alerts without adding additional scope or new alerts. The scope of infrastructure is massive; the team supports monitoring over 16,000 virtual machines and over 750 Azure subscriptions.
“We had to simplify to move forward,” Baxter says. “We needed to clearly define that we were moving alerts from here to there and then figure out what we wanted to do next.”
To assess the old alerts, the team spent a lot of time reviewing what led to the original tickets to be created.
“We had to look at the tickets, and ask ourselves, ‘How many alerts did we get? What was the severity? How was it resolved?” Baxter says. “Did we have an alert once six months ago and it caused an outage, or did we get it 300 times in the last week and closed the ticket? All of this was to remove noise and simplify our migration.”
After a lot work, the results were magical—and much more manageable.
“In the end, we went from over 100 alerts in SCOM to 15 in Azure Monitoring,” Baxter says. “This was very important, we knew the noise created by excessive alerts would frustrate and confuse our application teams.”
Migrating from SCOM to Azure Alerts was just part of the challenge.
“This is not just a technology change, but a culture change,” Baxter says. “It wasn’t only that we would remove SCOM central monitoring, but we had to tell our application teams, now you’re going to manage alerts..”
It comes down to changing how the team works. “This is the ‘Ops’ part of DevOps,” she says.
Baxter found that a high level of support would be required to make this successful across a large organization.
“’I don’t know how to do that,’ was a common response from engineers when we started this process,” Baxter says.
To make the transition easier, the team built a toolkit, created “how to” documentation, and offered training sessions for the application engineering teams that had previously only monitored application performance (as they had always counted on Baxter’s team for infrastructure support).
“We had to make it easy for the teams, and part of that was changing the discussion,” she says.
Baxter had to engage with the engineers in a different way, telling them, “We are going to enable you to monitor your operations,” rather than, “We aren’t monitoring your stuff anymore.”
Baxter’s team had to change as well.
The democratization of the monitoring function required central infrastructure teams to let go of ownership and responsibility for something they had managed for many years.
“We developed the concept of guardrails, just enough control to ensure application teams can be agile, while not running off the road,” Baxter says. “This goes against the very fiber of traditional IT operations and our risk averse culture.”
While the SCOM infrastructure was retired in July of 2018, the teams’ work is not complete.
“We are continuously working to improve the toolkits and support we offer to application teams,” Baxter says. “In addition, we are the first and best customer of our own solutions at Microsoft—we work closely with the product engineering teams to ensure that these solutions exceed the needs of our customers.”