A very popular cliché used in Silicon Valley, the notion of having to “ship it and fix it and ship it again,” was all too familiar to my team as we focused our efforts on moving, managing, and monitoring solutions in Microsoft’s expedition to the cloud.
Hello again and welcome back to our blog series on how our team helped Microsoft move most of Microsoft’s internal workloads to the cloud and Microsoft Azure. My team in Microsoft Digital, the organization that powers, protects, and transforms Microsoft, is the primary horizontal infrastructure group and we’re responsible for ensuring our internal customers have servers, storage, and databases, all the hard-crunchy bits of hosting, to run the critical applications that make the company operate internally.
It became clear we were going to have to hybridize our management solution if we were going to get Microsoft’s expedition to the cloud right.
– Pete Apple, cloud services engineer, Microsoft Digital
- The learnings, pitfalls, and compromises of Microsoft’s expedition to the cloud
- Managing Microsoft Azure solutions on Microsoft’s expedition to the cloud (this story)
- Automating Microsoft Azure incident and change management on Microsoft’s move to the cloud
- The awesome ugly truth about decentralizing operations at Microsoft with a DevOps model
- Mapping Microsoft’s expedition to the cloud with good cartography
- Microsoft uses a scream test to silence its unused servers
In this blog post I want to share what it took for us to effectively migrate solutions from on-premises to the cloud while managing and monitoring them for day-to-day operations. Go here to read the first blog in our series: The learnings, pitfalls, and compromises of Microsoft’s expedition to the cloud.
When I was running the hosting environment on-premises, our physical and virtual machine (VM) footprint was spread across multiple geographic datacenters, in two primary security zones—“corporate” and “DMZ.” Corporate refers to our internally facing services that our own employees use day to day for their jobs, while the DMZ holds our partner facing services that interact with the outside world. You might have a similar environment.
We used Microsoft System Center Operations Manager (SCOM) for monitoring and Microsoft System Center Configuration Manager (SCCM) for patching (this set of tools has been combined into Microsoft Endpoint Configuration Manager). As we started to look at moving solutions over to Microsoft Azure, it became clear we were going to have to hybridize our management solution if we were going to get Microsoft’s expedition to the cloud right.
Microsoft Azure ExpressRoute allowed us to “lift and shift” many of our on-premises VMs to the cloud as-is, which allowed us to operate them unchanged without disrupting our users. As more and more hosts moved from on-premises into Microsoft Azure, we eventually did a lift and shift on the Microsoft System Center servers themselves, so they were also operating out of a Microsoft Azure datacenter. Fair warning—there’s a tipping point when you get over 50 percent into the cloud based on the size of your environment and how quickly you’re moving VMs into the cloud, so think about it ahead of time.
Along the way, we learned that, in many cases, a cloud transition coincides nicely with shifting your application team to a DevOps model of deployment and management. We realized this early, which allowed us to change our technology and site reliability engineering practices in unison. For the DMZ and other internet-facing solutions, there were other options. We made sure our VMs in our internet-facing environment were within Microsoft Azure Update Management, so they stayed up to date and monitored.
For teams looking to move to a modern cloud solution like PaaS or SaaS, we encourage other options rather than trying to duplicate past solutions. If an application was being refactored into a cloud native service without an operating system (and thus a SCOM/SCCM agent), we used modern monitoring solutions like Microsoft Azure Application Insights and Microsoft Azure Monitoring.
When I look back at Microsoft’s expedition to the cloud, it’s clear that we built the plane while flying it.
The evolution of moving to the cloud
Today, we in Microsoft Digital—Microsoft’s IT division—still operate a small System Center Endpoint Confirmation Management environment in corporate, which some teams continue to use for on-premises resources. All our Microsoft Azure resources have shifted to Azure native management, like Azure Monitor and Azure Update Management.
We had to learn to be flexible about management solutions because there are more options than just the simple “OS patch/monitor” world that we lived with for years.
– Pete Apple, cloud services engineer, Microsoft Digital
One pivotal lesson we learned early on was to share best practices across both our team and the company—that way no one had to make the same mistake twice. This helped us make sure we used the most current monitor solutions and thinking each time we deployed a new application. For example, when one team started using Azure for management we were able to share out what they learned, including using its update management and log analytics features to improve their operations.
Additionally, once we became a hybrid operation, we had to learn to be flexible about management solutions because there are more options than just the simple “OS patch/monitor” world that we lived with for years. This transition also changed the way we handle traditional information technology infrastructure library (ITIL) change and incident management—a new set of challenges as we trekked further into the cloud, which I’ll go into next time.