How an internal cloud migration is boosting Microsoft Azure

May 30, 2019   |  

When Microsoft set out to move its massive internal workload of 60,000 on-premises servers to the cloud and to shutter its handful of sprawling datacenters, there was just one order from company leaders looking to go all-in on Microsoft Azure—do it quickly.

However, it was 2014, still the early days of moving large, deeply rooted enterprises like Microsoft to the cloud. And the IT pros in charge of making it happen had few tools to do it and little guidance on how to go about it.

“We had a lot to learn,” says Pete Apple, a principal service engineer in Microsoft Core Services Engineering and Operations (CSEO). “We started with a few Azure subscriptions—we were kicking the tires, figuring things out, assessing how much work we had to do.”

As it turns out, quite a bit of work. More on that in a moment.

[Go here to learn more about Microsoft’s internal migration to Azure. Click here to learn more about modern engineering leadership at Microsoft, click here to learn about Microsoft’s journey to the cloud, and click here to learn about Microsoft’s cloud strategy.]

Now, five years later, the company’s migration is 95 percent complete and the list of lessons learned is long. Five company datacenters are no more and there are fewer than 800 on-prem servers left to migrate. And that massive workload of 60,000 servers? Using a combination of modern engineering to redesign the company’s applications and to prune unused workloads, that number has been reduced. Microsoft is now running on 7,474 virtual machines in Azure and 1,567 virtual machines on-premises.

“What we’ve learned along the way has been rolled into the product,” Apple says. “We did go through some fits and starts, but it’s very smooth now. Our bumpy experience is now helping other companies have an easier time of it (with their own migrations).”

Vishal Mehrotra agrees. A principal program manager on the Microsoft Azure product team, Mehrotra rattled off a list of tools and services that have been added to Azure. They got their start from the work done and lessons learned during Microsoft’s internal migration. The highlight of that process was the creation of Azure Migrate and several services within it, including tools for migrating servers, databases, and third-party applications.

“We turned what Pete and team learned into several new products and services,” Mehrotra says. “Our migration tools are very much grounded in the challenges big enterprises are facing when they look at moving to the cloud, and we wouldn’t be where we are today without the work that our CSEO cloud migration team has done.”

The beauty of a decision framework

It didn’t start out that way, but the process of migrating a workload to Azure inside Microsoft is super smooth now, Apple says. He explains that everything started working better when they began using a decision tree like the one shown here.

CSEO cloud migration decision tree

A flow-chart graphic that takes the reader through the decisions the CSEO cloud migration team had to make each time it proposed moving an internal Microsoft workload to the cloud.
The cloud migration team used this decision tree to guide it through migrating the company’s 60,000 on-premises servers to the cloud. (Graphic by Marissa Stout | Showcase)

First, the CSEO migration team members asked themselves, “Are we building an entirely new experience?” If the answer was “yes,” then the decision was easy: Build a modern application that takes full advantage of all the benefits of building natively in the cloud.

If the answer was “no, we need to move an existing application to the cloud,” the decision tree got a bit more complex and required the team to answer a couple of tough questions: Do you want to take the Platform as a Service (PaaS) approach and rebuild your experience from the ground up to take full benefit of the cloud (not everyone can afford to take the time needed or has the budget to do this)? Or do you want to take the Infrastructure as a Service (IaaS) approach, and lift and shift with a plan to rebuild in the future when it makes more sense to start fresh?

Tied to this question were two kinds of applications: those built for Microsoft by third-party vendors, and those built by CSEO or another team in Microsoft.

On the third-party side, flexibility was limited—the team would either take a PaaS approach and start fresh, or it would lift and shift to Azure IaaS.

“We had more choices with the internal applications,” Apple says, explaining that the team divvied those up between mission-critical and noncritical apps.

For the critical apps, the team first sought money and engineering time to start fresh and modernize. “That was the ideal scenario,” Apple says. If money wasn’t available, the team took an IaaS approach with a plan to modernize when feasible.

Noncritical projects were lifted and shifted and left as-is until they were no longer needed. The idea was that they would be shut down once something new could be built that would absorb that task or die on the vine when they become irrelevant.

“In a lot of cases, we didn’t have the expertise to keep our noncritical apps going,” Apple says. “Many of the engineers who worked on them moved onto other teams and other projects. Our thinking was, if there is some part of the experience that became important again, we would build something new around that.”

Getting migration right

Apple says the CSEO migration team initially thought the migration would be as simple as implementing one big lift-and-shift operation. It was a common mindset at the time: Take all your workloads and move them to the cloud as-is and figure out the rest later.

“That wasn’t the best way, for a number of reasons,” he says, adding that there was a myriad of interconnections and embedded systems to sort out first. “We quickly realized the migration was going to be far more complex than we thought.”

After a lot of rushing around, the team realized it needed to step back and think more holistically.

The first step was to figure out exactly what they had on their hands—literally. Microsoft had workloads spread across more than 10 datacenters, and no one was tracking who owned all of them or what they were being used for (or if they were being used at all).

Longtime Microsoft culture dictated that you provision whatever you thought you might need, and to go big to make sure you covered your worst-case scenario. Once the upfront cost was covered, teams would often forget about how much it cost to keep all those servers running. With teams spinning up production, development, and test environments, the amount of untracked capacity was large and always growing.

“Sometimes, they didn’t even know what servers they were using,” Apple says. “We found people who were using test environments to run their main services.”

And figuring out who was paying for what? Good luck.

“There was a little bit of cost understanding, of what folks were thinking they had versus what they were paying for, that we had to go through,” Apple says. “Once you move to Azure, every cost is accounted for—there is complete clarity around everything that you’re paying for.”

There were some surprising discoveries.

“Why are we running an entire Exchange Server with only eight people using it? That should be on Office 365,” Apple says. “There were a lot of ‘let’s find an alternative and just retire it’ situations that we were able to work through. It was like when you open your storage facility from three years ago and suddenly realize you don’t need all the stuff you thought you needed.”

Moving to the cloud created opportunities to do many things over.

“We were able to clean up many of our long-running sins and misdemeanors,” Apple says. “We were able to fix the way firewalls were set up, lock down our ExpressRoute networks, and (we) tightened up access to our Corpnet. Moving to the cloud allowed us to tighten up our security in a big way.”

Essentially, it was a greenfield do-over opportunity.

“We didn’t do it enough, but when we did it the right way, it was very powerful,” says Heather Pfluger, a principal group manager on CSEO’s Core Platform Engineering Team, who had a front-row seat during the migration.

She says many mistakes were made along the way, which makes sense because the team was trying to both learn a new technology and change decades of ingrained thinking.

“We did dumb things,” Pfluger says. “We definitely lifted and shifted into some financial challenges. We didn’t redesign as we should have. We didn’t optimize as we should have.”

All those were learning moments, she says, pointing, as an example, to how the team now uses an optimization dashboard to buy only what it needs. It’s a change that’s saving CSEO millions of dollars.

Apple says those new understandings are making a big difference all over the company.

“We had to get people into the mindset that moving to the cloud creates new ways to do things,” he says. “We’re resetting how we run things in a lot of ways, and it’s changing how we run our businesses.”

He rattled off a long list of things the team is doing differently, including:

  • Sending events and alerts straight to DevOps teams versus to central IT operations
  • Spinning up resources in minutes for just the time needed (versus having to plan for long racking times or VMs that used to take a week to manually build out)
  • Dynamically scale resources up and down based upon load
  • Resizing month-to-month or week-to-week based upon cyclical business rhythms versus using the old “continually running” model
  • Having some solutions costs drop to zero or near zero when idle
  • Moving away from custom Windows operating system image for builds to using Azure gallery image and Azure automation to update images
  • Creating software defined networking configurations in the cloud versus physical networked firewalled configurations that required many manual steps
  • Managing on premises environments with Azure tools

Pfluger’s team builds the telemetry tools Microsoft employees use every day.

“There is so much more we can do now,” she says, explaining that the goal is always to improve satisfaction. “We don’t want our internal users to find problems with our reporting—we want to find them ourselves and fix them so fast that our employee users never notice anything was wrong.”

And it’s starting to work.

“We’ve gotten to the point where our employee users discovering a problem is becoming more rare,” Pfluger says. “We’re getting better, but we still have a long way to go.”

Apple hopes everyone continues to learn, adjust, and find better ways to do things.

“All of our investments and innovations are now all occurring in the cloud,” he says. “The opportunity to do new and more powerful things is just immense. I’m looking forward to seeing where we go next.”

Go here to learn more about Microsoft’s internal migration to Azure.

Click here to learn more about modern engineering leadership at Microsoft, click here to learn about Microsoft’s journey to the cloud, and click here to learn about Microsoft’s cloud strategy.

Tags: , , ,