Machine Learning for Systems and Tiered AIOps

Established: September 1, 2016

In this project, we are investigating the use of Machine Learning (ML) for improving computer systems (vs mimicking human behavior) and, in particular, cloud platforms. As a first step in this direction, we built Resource Central, a general ML and prediction-serving system that we have deployed in all Azure Compute clusters world-wide. It trains ML models offline and uses them to produce predictions online. The predictions can be used by other Azure components to improve resource, performance, and availability management. For example, the server defragmentation engine and the VM scheduler are two of the platform components that already use predictions (e.g., VM lifetime, VM migration blackout/brownout times) from Resource Central in production. We have recently expanded the project’s scope to AIOps, and in particular the notion of Tiered AIOps. The goal is to create systems support for ML-driven management of cloud platforms and for non-expert manual intervention when the ML fails in any way. As part of the Tiered AIOps effort, we have been exploring the use of large language models.

This project is a close collaboration between Azure and E+D.