Microsoft uses a scream test to silence its unused servers

Oct 20, 2021   |  

Microsoft Digital PerspectivesDo you have unused servers on your hand? Don’t be alarmed if I scream about it—it’ll be for a good reason (and not just because it’s almost Halloween)!

I talked previously about our efforts here in Microsoft Digital to inventory our internal-to-Microsoft on-premises environments to determine application relationships (mapping Microsoft’s expedition to the cloud with good cartography) as well as look at performance info for each system (the awesome ugly truth about decentralizing operations at Microsoft with a DevOps model).

With this info, it was time to begin making plans to move to the cloud. Looking at the data, our overall CPU usage for on-premises systems was far lower than we thought—averaging around six percent! We realized this was so low due to many underutilized systems. First things first, what to do with the systems that were “frozen,” or not being used, based upon the 0-2 percent CPU they were utilizing 24/7?

We created a plan to closely examine those assets towards the goal of moving as few as possible. We used our home-built change management database (CMDB) to check whether there was a recorded owner. In some cases, we were able to work with that owner and retire the system.

Before we turned even one server off, we had to be sure it wasn’t being used. (If a server is turned off and no one is there to see it, does it make a sound?)

[Read the rest of the series on Microsoft’s move to the cloud: Mapping Microsoft’s expedition to the cloud with good cartography, Automating Microsoft Azure incident and change management on Microsoft’s move to the cloudThe learnings, pitfalls, and compromises of Microsoft’s expedition to the cloudManaging Microsoft Azure solutions on Microsoft’s expedition to the cloud, and The awesome ugly truth about decentralizing operations at Microsoft with a DevOps model.]

Developing a scream test

But what if the owner information was wrong? Or what if that person had moved on? For those, we created a new process: the Scream Test. (Bwahahahahaaaa!)

What’s the Scream Test? Well, in our case it was a multistep process:

  1. Display the message “Hey, is this your server, contact us?” on the sign-in splash page for two weeks.
  2. Restart the server once each day for two weeks to see whether someone opens a ticket (in other words, screams).
  3. Shut down the server for two weeks and see whether someone opens a ticket. (Again, whether they scream.)
  4. Retire the server, retaining the storage for a period, just in case.

With this effort, we were able to retire far more unused servers—around 15 percent—than we had expected, without worrying about moving them to the cloud. Winning! We also were able to reclaim more resources on some of the Hyper-V hosts that were slated to continue running on-premises. And as a final benefit, we cleaned up our CMDB a bit!

In parallel, we initiated an effort to look at some of the systems that were infrequently used or used a very low level of CPU (less than 10 percent, or “Cold”). From that, we had two outcomes that proved critical for our successful migration to the cloud.

The first was to identify the systems in our on-premises environments that were oversized. People had purchased physical machines or sized virtual machines according to what they thought the load would be, and either that estimate was incorrect or the load diminished over time. We took this data and created a set of recommended Azure VM sizes for every on-premises system to use for migration. In other words, we downsized on the way to the cloud versus after the fact.

At the time, we did a bunch of this work by hand, manually because we were early adopters. Microsoft now has a number of great products available that help assist with this inventory and review of your on-premises environment that you should check out. To learn more, check out this article with documentation on Azure Migrate.

Another statistic that the data revealed was the number of systems that were used for only a few days or a week out of each month. Development machines, test/QA machines, and user acceptance testing machines reserved for final verification before moving code to production were used for only short periods. The machines were on continuously in the datacenter, mind you, but they were actually being used for only short periods each month.

For these, we investigated ways to have those systems running only when required by investing in two technologies: Azure Resource Manager Templates and Azure Automation. But this is a story for the next time. Until then, happy Halloween!

Related links

Read the rest of the series on Microsoft’s move to the cloud:

Tags: , , , , , , ,