In my last post, I talked about a gap in inventory of our own on-premises resources which made it tougher to move to the cloud. Once we worked through the data gaps and mapped out how systems were interconnected, we were ready to go, right? Well, we might have underestimated our preparation just a little bit.
Prior to helping move things to the cloud, I owned the compute service, which manages all the physical servers and virtual machines in Core Services Engineering & Operations (CSEO). We looked for systems that were exceeding their expected performance behavior and being overtaxed. We would analyze these systems from a performance perspective and implement problem management activities to get them back under control and humming along.
As a part of that effort, we created a set of System Center Operations Manager collection rules to gather CPU, Disk, and Memory counters about our environment at 5-minute intervals. We would then take the 5-minute interval data and aggregate it into a 24-hour aggregation of the P95 performance of the CPU and Disk for each system for each day, and then look at the trends over a 30-day period. With that data, we can identify performance trends. Was the system running hot all the time? Was it running hot only on certain days, hardly running warm at all, or maybe even very idle?
This data made it possible for us to broadly see if there were any systems that were starting to trend towards being “On Fire” for large periods of each month. Systems that were “On Fire” needed to be examined for performance bottlenecks, and perhaps expanded in CPU or Storage performance.
A side benefit we found from this data though is that we could also recognize systems that were hardly being used. In fact, we found some systems that were just plain idle! We created a reporting table and categorized systems into one of 5 categories:
- On Fire
This categorization provided an easy-to-understand language we could use for capacity planning. When we were running our own datacenters, we could load balance virtual machine workloads across physical hosts, taking advantage of all the hardware we had. This practice can’t be transferred into the public cloud, where we are paying for each virtual machine. If you spin up virtual machines that aren’t doing much work, that can get pretty expensive. We noted some systems that only had a spike of CPU use on the patch day when we patched them and rebooted them!
Over time, we’ve moved this collection system from System Center Operations Manager into OMS and Log Analytics collections, so we have a broad collection across all our systems, both on premises and in Azure IaaS. We’ve also started to incorporate the recommendations that are available from the Azure Advisor feature within Azure that provides guidance on system usage and potential savings.
So — once we realized we were going to start moving unused systems into the cloud, we looked at this data and scratched our heads a bit. Did it make sense to migrate a system that was hardly used? We needed to be smart about this. More on that next time.
Learn more about our cloud journey here: