The anatomy of shutting down a datacenter

Dec 11, 2017   |  

The humming servers fizzled to stillness, the odor of warm, always-on circuits wafted away for the last time, and a datacenter was turned off for good.

Effectively, it died.

Marked by little fanfare, the end of an area came when Microsoft Core Services Engineering (CSE, formerly Microsoft IT) turned off its first datacenter in 2013.

Flipping that last switch to OFF and yanking the last power cord out of the wall marked the beginning of the Microsoft journey to shut down 60,000 on-premises servers it had built up over the years to support the many activities of the company’s 124,000 employees.

There were 8,000 servers in that first datacenter, a large, nondescript building in south Seattle. The datacenter was known as TK3. It was the first candidate for closure because it mainly supported test workloads.

The transformation was bumpy.

“We were doing very, very little around cloud at the time,” says Rob Beddard, a senior service engineer in CSE. “It was like we were poking at (the cloud) with a stick from a distance. Nothing majorly had moved by that time.”

Beddard says the team was aware of the company’s move to the cloud and the focus on Microsoft Azure, but it felt like it was something someone else was going through. The team built a conservative plan for moving to the cloud, one built around testing and lots of gradual steps.

That plan was unceremoniously rejected.

“Our leadership took those plans to Jim DuBois, who was the CIO at the time,” Beddard says. “We said, ‘here’s our burndown plan—we’ll do it very carefully.’ He basically turned around and said, ‘move it all to the cloud right now – we’re not going to wait for anything.’”

Moving to the cloud

The leadership team went to Beddard and several others and said, “You’re going to help us move to the cloud. We’re going to call your work Stratus.”

It was a let’s-throw-some-bodies-at-this situation. “My boss at the time, came to us and said, ‘someone needs to get their hands around this,’” Beddard says. “’Everyone in IT is going to the cloud, and we can’t do it in a haphazard way.’”

Team Stratus was told to maintain security, build out a good governance plan, and finish yesterday.

“The first thing we asked ourselves, was ‘How do you operate in the cloud?’ We didn’t even know what we needed to move to the cloud,” he says.

The team first assessed which applications were cloud ready.

“We had to know if it would work in the cloud,” Beddard says. “Because we were targeting our dev test workloads, it was easy for us to learn, iterate rapidly, and fail fast. If we were going to fail, we wanted to fail on something that wasn’t mission critical, and then find something that worked better.”

They built a framework to look over the Microsoft IT portfolio. “We drilled all the way down to our VMs and physicals,” Beddard says. “That’s the lowest common denominator, and we had to go there if we were going to figure out how to do it right.”

Typically, a cluster of virtual machines support a given application, and because they all support the same process, they all had to move together, which meant that if one was blocked, the entire application was blocked. “That meant one blocker could prevent us from moving several VMs,” Beddard says.

That made it impossible to move everything at once. “There was no way we were going to get to 100 percent accuracy on our analysis, so we started moving servers once we got to 70-80 percent,” Beddard says.

The team set up traffic light categories: Green meant the application was ready to migrate, those flagged yellow could be moved after modest prep, and those that got the red light were blocked—they needed major work before they could move.

“We got started moving the green right away,” he says. “If it was yellow, we would ask ourselves, ‘can we remediate it? Maybe it’s just using an old version of Windows – can we update it and go?’”

As for those clusters of VMs with a blocker? Those were given a red stoplight until the blocker could be knocked down (or until a decision was made to leave the application on-premises).

“The vast majority of the platform moved fairly easily,” he says. “We probably moved 70 percent of that stuff to Azure, and 30 percent stayed on premises at that time.”

If the project started today, everything would move to Azure right away, but Azure itself wasn’t ready at the time. “Today it is,” he says. “Over the last year and a half, we’ve proved that out.”

Culture of cloud: Getting your people onboard

Like many companies moving their compute to the cloud know all too well, you need a leap of faith from the rank and file to make it work. “Typically, people think, ‘no, I can’t do this because I’m scared,’” Beddard says. “IT is pretty good at coming up with reasons why we shouldn’t do it.”

The Stratus team found that taking a data-driven approach was the best way to get employees to embrace the change. “That takes the fear out of it,” he says. “You have to show them the numbers–how it’ll be cheaper and faster, and more efficient—before they’ll sign up.”

The team was learning as it went.

“We were still thinking in an on-premises way, but that fundamentally doesn’t work in the cloud,” Beddard says. “We needed to start thinking in a new way, like ‘we don’t need to build for the worst-case scenario anymore, we just have to be ready to buy more compute when we need it, and then scale back when the big crush of work drops back to normal.’”

Another lesson the team learned is use the right words.

“Language can dictate behavior,” Beddard says. “Blockers. If I had the ability to go back and tweak things, that might be one of the fundamental things I’d change. Calling things “blockers” hurt us. It’s a negative word, and people found it really easy to throw their hands in the air, wail, and go ‘I’m blocked.’”

The story changed as the language changed.

“As we matured the process, we transitioned to more of an agile methodology, and our blockers became user stories,” he says. “That got us out of the mire, and let us actually start talking about requirements.”

Now, four years later, almost everything is on the cloud, no one is afraid, and Core Services Engineering is continuing to shut its datacenters.

Watch Brad Wright, one of the leaders driving change at Microsoft, explain how the shift to the cloud changed everything he knew to be true about IT.


Tags: ,