{"id":931554,"date":"2023-04-10T09:00:00","date_gmt":"2023-04-10T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=931554"},"modified":"2023-08-29T09:48:47","modified_gmt":"2023-08-29T16:48:47","slug":"building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai\/","title":{"rendered":"Building toward more autonomous and proactive cloud technologies with AI"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1.jpg\" alt=\"Vision of AIOps Research with four quadrants (starting in the top left and proceeding clockwise): Autonomous, Proactive, Manageable, Comprehensive\" class=\"wp-image-931677\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems\/\">Read part 1<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"cloud-intelligence-aiops-blog-series\">Cloud Intelligence\/AIOps blog series<\/h2>\n\n\n\n<p>In the first blog post in this series, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cloud Intelligence\/AIOps \u2013 Infusing AI into Cloud Computing Systems<\/a>, we presented a brief overview of Microsoft\u2019s research on Cloud Intelligence\/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to help design, build, and operate complex cloud platforms and services effectively and efficiently at scale. As cloud computing platforms have continued to emerge as one of the most fundamental infrastructures of our world, both their scale and complexity have grown considerably. In our previous blog post, we discussed the three major pillars of AIOps research: AI for Systems, AI for Customers, and AI for DevOps, as well as the four major research areas that constitute the AIOps problem space: detection, diagnosis, prediction, and optimization. We also envisioned the AIOps research roadmap as building toward creating more autonomous, proactive, manageable, and comprehensive cloud platforms.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"vision-of-aiops-research\">Vision of AIOps Research<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Autonomous<\/strong><\/td><td><strong>Proactive<\/strong><\/td><td><strong>Manageable<\/strong><\/td><td><strong>Comprehensive<\/strong><\/td><\/tr><tr><td>Fully automate the operation of cloud systems to minimize system downtime and reduce manual efforts.<\/td><td>Predict future cloud status, support proactive decision-making, and prevent bad things from happening.<\/td><td>Introduce the notion of tiered autonomy for infusing autonomous routine operations and deep human expertise.&nbsp;<\/td><td>Span AIOps to the full cloud stack for global optimization\/management and extend to multi-cloud environments.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Starting with this blog post, we will take a deeper dive into Microsoft\u2019s vision for AIOps research and the ongoing efforts to realize that vision. This blog post will focus on how our researchers leveraged state-of-the-art AIOps research to help make cloud technologies more autonomous and proactive. We will discuss our work to make the cloud more manageable and comprehensive in future blog posts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"autonomous-cloud\">Autonomous cloud<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"motivation\">Motivation<\/h3>\n\n\n\n<p>Cloud platforms require numerous actions and decisions every second to ensure that computing resources are properly managed and failures are promptly addressed. In practice, those actions and decisions are either generated by rule-based systems constructed upon expert knowledge or made manually by experienced engineers. Still, as cloud platforms continue to grow in both scale and complexity, it is apparent that such solutions will be insufficient for the future cloud system. On one hand, rigid rule-based systems, while being knowledge empowered, often involve huge numbers of rules and require frequent maintenance for better coverage and adaptability. Still, in practice, it is often unrealistic to keep such systems up to date as cloud systems expand in both size and complexity, and even more difficult to guarantee consistency and avoid conflicts between all the rules. On the other hand, engineering efforts are very time-consuming, prone to errors, and difficult to scale.<\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"999693\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Event Series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-label=\"Microsoft Research Forum\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Research-Forum-hero_1400x788.jpg\" alt=\"Research Forum | abstract background with colorful hexagons\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Forum<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-forum\" class=\"large\">Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/?OCID=msr_researchforum_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-forum\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft Research Forum\" target=\"_blank\">\n\t\t\t\t\t\t\tWatch on-demand\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>To break the constraints on the coverage and scalability of the existing solutions and improve the adaptability and manageability of the decision-making systems, cloud platforms must shift toward a more autonomous management paradigm. Instead of relying solely on expert knowledge, we need suitable AI\/ML models to fuse operational data and expert knowledge together to enable efficient, reliable, and autonomous management decisions.&nbsp;Still, it will take many research and engineering efforts to overcome various barriers for developing and deploying autonomous solutions to cloud platforms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"toward-an-autonomous-cloud\">Toward an autonomous cloud<\/h3>\n\n\n\n<p>In the journey towards an autonomous cloud, there are two major challenges. The first challenge lies in the heterogeneity of cloud data. In practice, cloud platforms deploy a huge number of monitors to collect data in various formats, including telemetry signals, machine-generated log files, and human input from engineers and users. And the patterns and distributions of those data generally exhibit a high degree of diversity and are subjected to changes over time. To ensure that the adopted AIOps solutions can function autonomously in such an environment, it is essential to empower the management system with robust and extendable AI\/ML models capable of learning useful information from heterogeneous data sources and drawing right conclusions in various scenarios.<\/p>\n\n\n\n<p>The complex interaction between different components and services presents another major challenge in deploying autonomous solutions. While it can be easy to implement autonomous features for one or a few components\/services, how to construct end-to-end systems capable of automatically navigating the complex dependencies in cloud systems presents the true challenge for both researchers and engineers. To address this challenge, it is important to leverage both domain knowledge and data to optimize the automation paths in application scenarios. Researchers and engineers should also implement reliable decision-making algorithms in every decision stage to improve the efficiency and stability of the whole end-to-end decision-making process.<\/p>\n\n\n\n<p>Over the past few years, Microsoft research groups have developed many new models and methods for overcoming those challenges and improving the level of automation in various cloud application scenarios across the AIOps problem spaces. Notable examples include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Detection: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gandalf<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cross-dataset-time-series-anomaly-detection-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">ATAD<\/a> for the early detection of problematic deployments; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/halo-hierarchy-aware-fault-localization-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">HALO<\/a> for hierarchical faulty localization; and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/onion-identifying-incident-indicating-logs-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Onion<\/a> for detecting incident-indicating logs.<\/li>\n\n\n\n<li>Diagnosis: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/spine-a-scalable-log-parser-with-feedback-guidance\/\" target=\"_blank\" rel=\"noreferrer noopener\">SPINE<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/uniparser-a-unified-log-parser-for-heterogeneous-log-data\/\" target=\"_blank\" rel=\"noreferrer noopener\">UniParser<\/a> for log parsing; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/logic-log-intelli\" target=\"_blank\" rel=\"noreferrer noopener\">Logic<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/fighting-the-fog-of-war-automated-incident-detection-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Warden<\/a> for regression and incident diagnosis; and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/conan-diagnosing-batch-failures-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">CONAN<\/a> for batch failure diagnosis.<\/li>\n\n\n\n<li>Prediction: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/how-long-will-it-take-to-mitigate-this-incident-for-online-service-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">TTMPred<\/a> for predicting time to mitigate incidents; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/effective-low-capacity-status-prediction-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">LCS<\/a> for predicting the low-capacity status in cloud servers; and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/spot-virtual-machine-eviction-prediction-in-microsoft-cloud-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Eviction Prediction<\/a> for predicting the eviction of spot virtual machines.<\/li>\n\n\n\n<li>Optimization: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/intelligent-container-reallocation-at-microsoft-365\/\" target=\"_blank\" rel=\"noreferrer noopener\">MLPS<\/a> for optimizing the reallocation of containers; and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/resin-a-holistic-service-for-dealing-with-memory-leaks-in-production-cloud-infrastructure\/\" target=\"_blank\" rel=\"noreferrer noopener\">RESIN<\/a> for the management of memory leak in cloud infrastructure.<\/li>\n<\/ul>\n\n\n\n<p>These solutions not only improve service efficiency and reduce management time with more automatous design, but also result in higher performance and reliability with fewer human errors. As an illustration of our work toward a more autonomous cloud, we will discuss our exploration for supporting automatic safe deployment services below.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"exemplary-scenario-automatic-safe-deployment\">Exemplary scenario: Automatic safe deployment<\/h3>\n\n\n\n<p>In online services, the continuous integration and continuous deployment (CI\/CD) of new patches and builds are critical for the timely delivery of bug fixes and feature updates. Because new deployments with undetected bugs or incompatible issues can cause severe service outages and create significant customer impact, cloud platforms enforce strict safe-deployment procedures before releasing each new deployment to the production environments. Such procedures typically involve multi-stage testing and verification in a sequence of canary environments with increasing scopes. When a deployment-related anomaly is identified in one of these stages, the responsible deployment is rolled back for further diagnosis and fixing. Owing to the challenges of identifying deployment-related anomalies with heterogeneous patterns and managing a huge number of deployments, safe-deployment systems administrated manually can be extremely costly and error prone.<\/p>\n\n\n\n<p>To support automatic and reliable anomaly detection in safe deployment, we proposed a general methodology named <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cross-dataset-time-series-anomaly-detection-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">ATAD<\/a> for the effective detection of deployment-related anomalies in time-series signals. This method addresses the challenges of capturing changes with various patterns in time-series signals and the lack of labeled anomaly samples due to the heavy cost of labeling. Specifically, this method combines ideas from both transfer learning and active learning to make good use of the temporal information in the input signal and reduce the number of labeled samples required for model training. Our experiments have shown that ATAD can outperform other state-of-the-art anomaly detection approaches, even with only 1%-5% of labeled data.<\/p>\n\n\n\n<p>At the same time, we collaborated with product teams in Azure to develop and deploy <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure\/\" target=\"_blank\" rel=\"noreferrer noopener\">Gandalf<\/a>, an end-to-end automatic safe deployment system that reduces deployment time and increases the accuracy of detecting bad deployment in Azure. As a data-driven system, Gandalf monitors a large array of information, including performance metrics, failure signals and deployment records. It also detects anomalies in various patterns throughout the entire safe-deployment process. After detecting anomalies, Gandalf applies a vote-veto mechanism to reliably determine whether each detected anomaly is caused by a specific new deployment. Gandalf then automatically decides whether the relevant new deployment should be stopped for a fix or if it\u2019s safe enough to proceed to the next stage. After rolling out in Azure, Gandalf has been effective at helping to capture bad deployments, achieving more than 90% precision and near 100% recall in production over a period of 18 months.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Flow of Automatic Safe Deployment System\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"650\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1.png\" alt=\"Flow of Automatic Safe Deployment System\" class=\"wp-image-932511\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1-300x139.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1-1024x475.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1-768x357.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-1-240x111.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/a><figcaption class=\"wp-element-caption\">Flow of Automatic Safe Deployment System<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"proactive-cloud\">Proactive cloud<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"motivation\">Motivation<\/h3>\n\n\n\n<p>Traditional decision-making in the cloud focuses on optimizing immediate resource usage and addressing emerging issues. While this reactive design is not unreasonable in a relatively static system, it can lead to short-sighted decisions in a dynamic environment. In cloud platforms, both the demand and utilization of computing resources are undergoing constant changes, including regular periodical patterns, unexpected spikes, and gradual shifts in both temporal and spatial dimensions. To improve the long-term efficiency and reliability of cloud platforms, it is critical to adopt a proactive design that takes the future status of the system into account in the decision-making process.<\/p>\n\n\n\n<p>A proactive design leverages data-driven models to predict the future status of cloud platforms and enable downstream proactive decision-making. Conceptually, a typical proactive decision-making system consists of two modules: a prediction module and a decision-making module, as displayed in the following diagram.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Cloud Platform Prediction Module\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"748\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2.png\" alt=\"Cloud Platform Prediction Module\" class=\"wp-image-932514\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2-300x160.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2-1024x547.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2-768x410.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-2-240x128.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/a><\/figure>\n\n\n\n<p>In the prediction module, historical data are collected and processed for training and fine-tuning the prediction model for deployment. The deployed prediction model takes in the online data stream and generates prediction results in real time. In the decision-making module, both the current system status and the predicted system status, along with other information such as domain knowledge and past decision history, is considered for making decisions that balance both present and future benefits.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"toward-proactive-design\">Toward proactive design <\/h3>\n\n\n\n<p>Proactive design, while creating new opportunities for improving the long-term efficiency and reliability of cloud systems, does expose the decision-making process to additional risks. On one hand, thanks to the inherent randomness in the daily operation of cloud platforms, proactive decisions are always subjected to the uncertainty risk from the stochastic elements in both running systems and the environments. On the other hand, the reliability of prediction models adds another layer of risks in making proactive decisions. Therefore, to guarantee the performance of proactive design, engineers must put mechanisms in place to address those risks.<\/p>\n\n\n\n<p>To manage uncertainty risk, engineers need to reformulate the decision-making in proactive design to account for the uncertainty elements. They can often use methodological frameworks, such as prediction+optimization and optimization under chance-constraints, to incorporate uncertainties into the target functions of optimization problems. Well-designed ML\/AL models can also learn uncertainty from data for improving proactive decisions against uncertainty elements. As for risks associated with the prediction model, modules for improving data quality, including quality-aware feature engineering, robust data imputation, and data rebalancing, should be applied to reduce prediction errors. Engineers should also make continuous efforts to improve and update the robustness of prediction models. Moreover, safeguarding mechanisms are essential to prevent decisions that may cause harm to the cloud system.<\/p>\n\n\n\n<p>Microsoft\u2019s AIOps research has pioneered the transition from reactive decision-making to proactive decision-making, especially in problem spaces of prediction and optimization. Our efforts not only lead to significant improvement in many application scenarios traditionally supported by reactive decision-making, but also create many new opportunities. Notable proactive design solutions include <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/predictive-and-adaptive-failure-mitigation-to-avert-production-cloud-vm-interruptions-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Narya<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/nenya-cascade-reinforcement-learning-for-cost-aware-failure-mitigation-at-microsoft-365\/\" target=\"_blank\" rel=\"noreferrer noopener\">Nenya<\/a> for hardware failure mitigation, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/intelligent-virtual-machine-provisioning-in-cloud-computing\/\" target=\"_blank\" rel=\"noreferrer noopener\">UAHS<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/cahs-an-effective-correlation-aware-heuristic-search-approach-for-intelligent-virtual-machine-provisioning\/\" target=\"_blank\" rel=\"noreferrer noopener\">CAHS<\/a> for the intelligent virtual machine provisioning, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/predictive-job-scheduling-under-uncertain-constraints-in-cloud-computing\/\" target=\"_blank\" rel=\"noreferrer noopener\">CUC<\/a> for the predictive scheduling of workloads, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/solving-the-batch-stochastic-bin-packing-problem-in-cloud-a-chance-constrained-optimization-approach\/\" target=\"_blank\" rel=\"noreferrer noopener\">UCaC<\/a> for bin packing optimization under chance constraints. In the discussion below, we will use hardware failure mitigation as an example to illustrate how proactive design can be applied in cloud scenarios.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"exemplary-scenario-proactive-hardware-failure-mitigation\">Exemplary scenario: Proactive hardware failure mitigation<\/h3>\n\n\n\n<p>A key threat to cloud platforms is hardware failure, which can cause interruptions to the hosted services and significantly impact the customer experience. Traditionally, hardware failures are only resolved reactively after the failure occurs, which typically involves temporal interruptions of hosted virtual machines and the repair or replacement of impacted hardware. Such a solution provides limited help in reducing negative customer experiences.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/predictive-and-adaptive-failure-mitigation-to-avert-production-cloud-vm-interruptions-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Narya<\/a> is a proactive disk-failure mitigation service capable of taking mitigation actions before failures occur. Specifically, Narya leverages ML models to predict potential disk failures, and then make decisions accordingly. To control risks associated with uncertainty, Narya evaluates candidate mitigation actions based on the estimated impacts to customers and chooses actions with minimum impact. A feedback loop also exists for collecting follow-up assessments to improve prediction and decision modules.<\/p>\n\n\n\n<p>Hardware failures in cloud systems are often highly interdependent. Therefore, to reduce the impact of predictions errors, Narya introduces a novel dependency-aware model to encode the dependency relationship between nodes to improve the failure prediction model. Narya also implements an adaptive approach that uses A\/B testing and bandit modeling to improve the ability to estimate the impacts of actions. Several safeguarding mechanisms in different stages of Narya are also in place to eliminate the chance of making unsafe mitigation actions. Implementation of Narya in Azure\u2019s production environment has reduced the node hardware interruption rate for virtual machines by more than 26%.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3.png\" alt=\"Narya's Feedback loop\" class=\"wp-image-932517\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3-300x116.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3-1024x395.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3-768x296.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/04\/AIOpsP2_Plot-3-240x93.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>Our recent work, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/nenya-cascade-reinforcement-learning-for-cost-aware-failure-mitigation-at-microsoft-365\/\" target=\"_blank\" rel=\"noreferrer noopener\">Nenya<\/a>, is another example for proactive failure mitigation. Under a reinforcement learning framework, Nenya fuses prediction and decision-making modules into an end-to-end proactive decision-making system. It can weigh both mitigation costs and failure rates to better prioritize cost-effective mitigation actions against uncertainty. Moreover, the traditional failure mitigation method usually suffers from data imbalance issues; cases of failure form only a very small portion of all cases, which have mostly healthy situations. Such data imbalance would introduce bias to both the prediction and decision-making process. To address this problem, Nenya adopts a cascading framework to ensure that mitigation decisions are not made with heavy costs. Experiments with Microsoft 365 data sets on database failure have proved that Nenya can reduce both mitigation costs and database failure rates compared with existing methods.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"future-work\">Future work<\/h2>\n\n\n\n<p>As management systems become more automated and proactive, it is important to pay special attention to both the safety of cloud systems and the responsibility to cloud customers. The autonomous and proactive decision system will depend heavily on advanced AI\/ML models with little manual effort. How to ensure that the decisions made by those approaches are both safe and responsible is an essential question that future work should answer.<\/p>\n\n\n\n<p>The autonomous and proactive cloud relies on the effective data usage and feedback loop across all stages in the management and operation of cloud platforms. On one hand, high-quality data on the status of cloud systems are needed to enable downstream autonomous and proactive decision-making systems. On the other hand, it is important to monitor and analyze the impact of each decision on the entire cloud platform in order to improve the management system. Such feedback loops can exist simultaneously for many related application scenarios. Therefore, to better support an autonomous and proactive cloud, a unified data plane responsible for the processing and feedback loop can take a central role in the whole system design and should be a key area of investment.<\/p>\n\n\n\n<p>As such, the future of cloud relies not only on adopting more autonomous and proactive solutions, but also on improving the manageability of cloud systems and the comprehensive infusion of AIOps technologies over all stacks of cloud systems. In future blog posts, we will discuss how to work toward a more manageable and comprehensive cloud.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/automatic-post-deployment-management-of-cloud-applications\/\">Read part 3<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-ai-for-tiered-cloud-platform-operation\/\">Read part 4<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In the first blog post in this series, Cloud Intelligence\/AIOps \u2013 Infusing AI into Cloud Computing Systems, we presented a brief overview of Microsoft\u2019s research on Cloud Intelligence\/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to help design, build, and operate complex cloud platforms and services effectively and efficiently at scale. As cloud [&hellip;]<\/p>\n","protected":false},"author":42183,"featured_media":931677,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13561,13556,13563,13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-931554","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-data-platform-analytics","msr-research-area-systems-and-networking","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Vision of AIOps Research with four quadrants (starting in the top left and proceeding clockwise): Autonomous, Proactive, Manageable, Comprehensive\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/03\/AIOps-2-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"April 10, 2023","formattedExcerpt":"In the first blog post in this series, Cloud Intelligence\/AIOps \u2013 Infusing AI into Cloud Computing Systems, we presented a brief overview of Microsoft\u2019s research on Cloud Intelligence\/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to help design, build, and operate complex cloud&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/931554","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=931554"}],"version-history":[{"count":17,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/931554\/revisions"}],"predecessor-version":[{"id":964521,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/931554\/revisions\/964521"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/931677"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=931554"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=931554"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=931554"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=931554"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=931554"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=931554"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=931554"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=931554"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=931554"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=931554"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=931554"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}