Automating Microsoft Azure incident and change management on Microsoft’s move to the cloud

Jun 10, 2021   |  

Microsoft’s move to the cloud at Microsoft has certainly been an adventure.

New technology has enabled us to transform many of our IT processes, and in some cases make them entirely disappear. It’s also compelled us to reevaluate our operational health and ability to stay on pace with evolving operational functions such as monitoring and patching, architectures, and change management.

As we’ve moved to the cloud, we have been focusing on aligning the company’s IT services with the needs of the business under an operational model formally known as Information Technology Infrastructure Library (ITIL).

Historically, we would create one- to two-year architectures and be fine! Now, we’re evaluating exciting new features at least on a quarterly basis. Our team has had to learn to be agile—both literally and metaphorically.

– Pete Apple, cloud services engineer, Microsoft Digital

You may be surprised (and perhaps a bit relieved) to learn that, from the point of view of a services engineer, our design and management functions have probably evolved the least on Microsoft’s move to the cloud. There’s certainly new technology to understand and incorporate into our architectural designs, but the team doing that work has basically remained the same. It’s been a great opportunity to learn about Microsoft Azure and how it handles compute, storage, data, and networks.

[Read the rest of Microsoft’s move to the cloud series: The learnings, pitfalls, and compromises of Microsoft’s expedition to the cloud and Managing Microsoft Azure solutions on Microsoft’s expedition to the cloud.]

One thing that has certainly kept us on our toes has been the ever-evolving architectural changes that happen in the cloud. The Microsoft Azure team releases new features at more frequent intervals versus the traditional releases of the past. Historically, we would create one- to two-year architectures and be fine! Now, we’re evaluating exciting new features at least on a quarterly basis. Our team has had to learn to be agile—both literally and metaphorically (referencing the Agile methodology).

Microsoft Azure enabled our operations to evolve and become more productive, with a faster service turnaround time. A good example is our change management discipline.

Over four years ago, we had many standard change requests from our internal customers. I was running the private cloud at the time, and you can imagine the number and variety of requests that came across my desk: “Create a virtual machine,” “Install SQL,” “Rebuild the operating system,” and so on. Each request was a change record in our system that was immediately assigned to a system engineer to do the work with a pressing service-level agreement (SLA) of 72 hours.

Sound familiar?

As we trekked further on Microsoft’s move to the cloud, we took a hard look at every change type in the internal catalog and automated everything that could be automated.

We reviewed the number and variety of change orders coming through and realized that with some scripting advances, System Center Orchestrator, Azure Templates, and Azure Automation, we could start automating many of these change activities. This enabled us to cut back on human error, improve the SLA, and in many cases implement a self-service approach for internal customers to deploy themselves instead of waiting on my team to implement the change manually.

Today, Microsoft Azure services are enabling Microsoft internal teams to self-service their own changes and skip the dreaded “open a ticket” model.

On the incident side, we also found similar ways to be more efficient.

Automating incident and change management through optimized architecture may sound a bit scary, but it’s been a real benefit to our organization.

– Pete Apple, cloud services engineer, Microsoft Digital

As our Microsoft Azure migrations increased, we found that our customer application developers wanted to have direct access to their Azure subscriptions to do more rapid DevOps-type deployments. This meant in many cases that they were finding and discovering issues or incidents almost instantaneously. They didn’t need to have a central team fronting incident management as much as they used to.

In response, we transitioned our incident management into a hybrid model—where the application teams can choose to have Microsoft Azure Monitoring and Application Insights alerts sent directly to them, and infrastructure alerts and outages still get forwarded to our centralized team. This has increased the skills required for some of the application teams to handle service reliability activities themselves and improved time to resolution and bug fixes for those same teams. What we’ve maintained is our centralized “escalation management” function that can help manage a major incident (or in the new nomenclature, a “LiveSite”).

Automating incident and change management through optimized architecture may sound a bit scary, but it’s been a real benefit to our organization. Removing some of the overhead in change management has cut costs in some cases by 30 to 40 percent and increased the speed of results for customers. I used to have a 48- to 72-hour SLA for building out a customer virtual machine. Now customers can spin one up in Microsoft Azure themselves in under 30 minutes!

Enabling teams to choose to receive alerts and incidents directly into their Microsoft Azure DevOps teams and escalate to central IT only when required empowers them to resolve items that impact their business more rapidly.

Unleashing Microsoft Azure and incorporating cloud patterns into architecture designs can really save time and costs for change management efforts, while improving the SLA and customer experience. But what does it mean for subscriptions and service over time? Check back with us soon as we continue the “Operationalizing the cloud” blog series and share insights and learnings from Microsoft’s move to the cloud.

Learn how Microsoft Azure services help configure and automate operational tasks across a hybrid environment, use ARM template documentation for efficient management, and provide a framework to manage the next generation of business apps and infrastructure.

Tags: ,