Operationalizing the cloud

Azure unleashed: automating incident and change management through optimized architecture

Mar 13, 2018   |  

The journey to the cloud at Microsoft has certainly been an adventure. New technology has enabled us to transform many of our IT processes and in some cases, make them entirely disappear. It’s also compelled us to reevaluate our operational health and ability to stay on pace with evolving operational functions such as monitoring and patching, architectures and change management.

For the last few years (and like many other enterprise IT shops around the word), Microsoft has been focusing on aligning IT services with the needs of the business under an operational model formally known as Information Technology Infrastructure Library (ITIL).

You may be surprised (and perhaps a bit relieved) to learn that as a services engineer, our design and management functions have probably evolved the least on our expedition to the cloud. There’s certainly new technology to understand and incorporate into our architectural designs, but the team doing that work has basically remained the same. It’s been a great opportunity to learn about Azure and how it handles compute, storage, data, and networks.

One thing that has certainly kept us on our toes has been the ever-evolving architectural changes that happen in the cloud. The Azure team releases new features at more frequent intervals versus the traditional releases of the past. Historically, we would create one to two-year architectures and now we’re evaluating exciting new features at least on a quarterly basis. Our team has had to learn to be agile—both literally and metaphorically (referencing the Agile Methodology).

Azure enabled our operations to evolve and become more productive, with a faster service turnaround time. A good example is our change management discipline. Over four years ago we had many standard change requests from our internal customers. I was running the private cloud at the time, and you can imagine the number and variety of requests that came across my desk: “Create a Virtual Machine”, “Install SQL”, “Rebuild the OS”, etc. Each request was a change record in our system that was immediately assigned to a system engineer to do the work with a pressing service-level agreement (SLA) of 72 hours. Sound familiar?

As we trekked further into the cloud, we took a hard look at every change type in the internal catalogue and automated anything that could be automated. We reviewed the number and variety of change orders coming through and realized that with some scripting advances, System Center Orchestrator, Azure Templates, and Azure Automation, we could start automating many of these change activities. This enabled us to cut back on human error, improve SLA, and in many cases, implement a self-service approach for internal customers to deploy themselves instead of waiting on my team to execute the change manually. Today, Azure services are enabling Microsoft internal teams to self-service their own changes and skip the dreaded “open a ticket” model.

On the incident side, we also found similar ways to be more efficient. As our Azure migrations increased, we found that our end customer application developers wanted to have direct access to their Azure subscriptions to do more rapid DevOps-type deployments. This meant in many cases that they were finding and discovering issues or incidents almost instantaneously. They didn’t need to have a central team fronting as much as they used to. So we transitioned our incident management into a hybrid model—where the application teams can choose to have Azure Monitoring and Application Insights alerts sent directly to them, and infrastructure alerts and outages still get forwarded to our centralized team. This has increased the skills required for some of the application teams to handle service reliability activities themselves and also improved time to resolution and bug fixes for those same teams. What we’ve maintained is our centralized “escalation management” function that can help manage a major incident (or in the new nomenclature, a “LiveSite”).

Automating incident and change management through optimized architecture may sound a bit scary but it’s been a real benefit to our organization. Removing some of the overhead in change management has cut costs in some cases by 30 to 40 percent and increased the speed of results for end customers. I used to have a 48 to 72-hour SLA for building out a customer virtual machine. Now customers can spin one up in Azure themselves in under 30 minutes! Having teams be able to choose to receive alerts and incidents directly into their DevOps teams and escalate to central IT only when required empowers them to resolve items that impact their business more rapidly.

Unleashing Azure and incorporating cloud patterns into architecture designs can really save time and costs for change management efforts, while improving service level agreement and customer experience. But what does it mean for subscriptions and service over time? Check back with us soon as we continue the Operationalizing the cloud blog series and share insights and learnings from our own journey at Microsoft.

Learn how Azure services help configure and automate operational tasks across a hybrid environment, automate cloud infrastructure for efficient management and provide the framework to manage the next generation of business apps and infrastructure.

Tags: , ,