{"id":896259,"date":"2022-11-10T09:00:00","date_gmt":"2022-11-10T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=896259"},"modified":"2023-08-29T09:47:28","modified_gmt":"2023-08-29T16:47:28","slug":"cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cloud-intelligence-aiops-infusing-ai-into-cloud-computing-systems\/","title":{"rendered":"Cloud Intelligence\/AIOps \u2013 Infusing AI into Cloud Computing Systems"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1441\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-scaled.jpg\" alt=\"The image has two circles side-by-side, each divided into three equal segments. An arrow between the two circles points from left to right to show the evolution from Microsoft\u2019s previous Software Analytics research to today\u2019s Cloud Intelligence\/AIOps.\" class=\"wp-image-897300\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-scaled.jpg 2560w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/figure>\n\n\n\n<p>When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray\u2019s words, to \u201cbuild a system used by millions of people each day and yet administered and managed by a single part-time person.\u201d&nbsp;&nbsp;<\/p>\n\n\n\n<p>Gray envisioned a self-organizing \u201cserver in the sky\u201d that would store massive amounts of data, and refresh or download data as needed. Today, with the emergence and rapid advancement of artificial intelligence (AI), machine learning (ML) and cloud computing, and Microsoft\u2019s development of Cloud Intelligence\/AIOps, we are closer than we have ever been to realizing that vision\u2014and moving beyond it.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Over the past fifteen years, the most significant paradigm shift in the computing industry has been the migration to cloud computing, which has created unprecedented digital transformation opportunities and benefits for business, society, and human life.&nbsp;&nbsp;<\/p>\n\n\n\n<p>The implication is profound: cloud computing platforms have become part of the world\u2019s basic infrastructure. As a result, the non-functional properties of cloud computing platforms, including availability, reliability, performance, efficiency, security, and sustainability, have become immensely important. Yet the distributed nature, massive scale, and high complexity of cloud computing platforms\u2014ranging from storage to networking, computing and beyond\u2014present huge challenges to building and operating such systems.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-cloud-intelligence-aiops\">What is Cloud Intelligence\/AIOps?<\/h2>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/video\/keynote-cloud-intelligence-infusing-ai-into-cloud-computing-systems\/\" target=\"_self\" aria-label=\"Keynote: Cloud Intelligence: Infusing AI into cloud computing systems\" data-bi-type=\"annotated-link\" data-bi-cN=\"Keynote: Cloud Intelligence: Infusing AI into cloud computing systems\" class=\"annotations__list-thumbnail\" >\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"172\" height=\"96\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-240x135.jpg\" class=\"mb-2\" alt=\"graphical user interface, application\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/10\/F0qEla88cXo.jpg 1280w\" sizes=\"auto, (max-width: 172px) 100vw, 172px\" \/>\t\t\t\t<\/a>\n\t\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">VIDEO<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/video\/keynote-cloud-intelligence-infusing-ai-into-cloud-computing-systems\/\" data-bi-cN=\"Keynote: Cloud Intelligence: Infusing AI into cloud computing systems\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Keynote: Cloud Intelligence: Infusing AI into cloud computing systems<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/video\/keynote-cloud-intelligence-infusing-ai-into-cloud-computing-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Cloud Intelligence\/AIOps<\/a> (\u201cAIOps\u201d for brevity) aims to innovate AI\/ML technologies to help design, build, and operate complex cloud platforms and services at scale\u2014effectively and efficiently. &nbsp;<\/p>\n\n\n\n<p>AIOps has three pillars, each with its own goal:&nbsp;&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AI for Systems to make intelligence a built-in capability to achieve high quality, high efficiency, self-control, and self-adaptation with less human intervention.&nbsp;&nbsp;<\/li>\n\n\n\n<li>AI for Customers to leverage AI\/ML to create unparalleled user experiences and achieve exceptional user satisfaction using cloud services.&nbsp;&nbsp;<\/li>\n\n\n\n<li>AI for DevOps to infuse AI\/ML into the entire software development lifecycle to achieve high productivity.&nbsp;&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Where did the research on AIOps begin?&nbsp;&nbsp;<\/p>\n\n\n\n<p>Gartner, a leading industry analyst firm, first coined the term <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.gartner.com\/en\/information-technology\/glossary\/aiops-artificial-intelligence-operations\" target=\"_blank\" rel=\"noopener noreferrer\">AIOps<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (Artificial Intelligence for IT Operations) in 2017. According to Gartner, AIOps is the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.gartner.com\/en\/documents\/3772124\" target=\"_blank\" rel=\"noopener noreferrer\">application of machine learning and data science to IT operation problems.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> While Gartner\u2019s AIOps concept focuses only on DevOps, Microsoft\u2019s Cloud Intelligence\/AIOps research has a much broader scope, including AI for Systems and AI for Customers. &nbsp;<\/p>\n\n\n\n<p>The broader scope of Microsoft\u2019s Cloud Intelligence\/AIOps stems from the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/software-analytics\/\" target=\"_blank\" rel=\"noreferrer noopener\">Software Analytics<\/a> research we proposed in 2009, which seeks to enable software practitioners to explore and analyze data to obtain insightful and actionable information for data-driven tasks related to software and services. We started to focus our Software Analytics research on cloud computing in 2014 and named this new topic Cloud Intelligence (Figure 1). In retrospect, Software Analytics is about the digital transformation of the software industry itself, such as empowering practitioners to use data-driven approaches and technologies to develop software, operate software systems, and engage with customers. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1320\" height=\"528\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1.jpg\" alt=\"The image has two circles side-by-side, each divided into three equal segments. An arrow between the two circles points from left to right to show the evolution from Microsoft\u2019s previous Software Analytics research to today\u2019s Cloud Intelligence\/AIOps. \" class=\"wp-image-896439\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1.jpg 1320w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1-300x120.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1-1024x410.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1-768x307.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig1-240x96.jpg 240w\" sizes=\"auto, (max-width: 1320px) 100vw, 1320px\" \/><figcaption class=\"wp-element-caption\">Figure 1: From Software Analytics to Cloud Intelligence\/AIOps<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-the-aiops-problem-space\">What is the AIOps problem space?&nbsp;<\/h2>\n\n\n\n<p>There are many scenarios around each of the three pillars of AIOps. Some example scenarios include predictive capacity forecasting for efficient and sustainable services, monitoring service health status, and detecting health issues in a timely manner in AI for Systems; ensuring code quality and preventing defective build deployed into production in AI for DevOps; and providing effective customer support in AI for Customers. Across all these scenarios, there are four major problem categories that, taken together, constitute the AIOps problem space: detection, diagnosis, prediction, and optimization (Figure 2). Specifically, detection aims to identify unexpected system behaviors (or anomalies) in a timely manner. Given the symptom and associated artifacts, the goal of diagnosis is to localize the cause of service issues and find the root cause. Prediction attempts to forecast system behaviors, customer workload patterns, or DevOps activities, and so on. Lastly, optimization tries to identify the optimal strategies or decisions required to achieve certain performance targets related to system quality, customer experience and DevOps productivity.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"864\" height=\"394\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig2_v2.jpg\" alt=\"The image has three columns, each with a stack of four items, which show the problems and challenges of AIOps and the techniques used to address them. \" class=\"wp-image-896445\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig2_v2.jpg 864w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig2_v2-300x137.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig2_v2-768x350.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/AIOps_Fig2_v2-240x109.jpg 240w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Problems and challenges of AIOps<\/figcaption><\/figure>\n\n\n\n<p>Each problem has its own challenges. Take detection as an example. To ensure service health at runtime, it is important for engineers to continuously monitor various metrics and detect anomalies in a timely manner. In the development process, to ensure the quality of the continuous integration\/continuous delivery (CI\/CD) practice, engineers need to create mechanisms to catch defective builds and prevent them from being deployed to other production sites.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Both scenarios require timely detection, and in both there are common challenges for conducting effective detection. For example, time series data and log data are the most common data forms. Yet they are often multi-dimensional, there may be noise in the data, and they often have different detection requirements\u2014all of which can pose significant challenges to reliable detection.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"microsoft-research-our-aiops-vision\">Microsoft Research: Our AIOps vision<\/h2>\n\n\n\n<p>Microsoft is conducting continuous research in each of the AIOps problem categories. Our goal for this research is to empower cloud systems to be more autonomous, more proactive, more manageable, and more comprehensive across the entire cloud stack.&nbsp;&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"making-cloud-systems-more-autonomous\">Making cloud systems more autonomous<\/h3>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure\/\" data-bi-cN=\"An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>AIOps strives to make cloud systems more autonomous, to minimize human operations and rule-based decisions, which significantly helps reduce user impact caused by system issues, make better operation decisions, and reduce maintenance cost. This is achieved by automating DevOps as much as possible, including build, deployment, monitoring, and diagnosis. For example, the purpose of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/an-intelligent-end-to-end-analytics-service-for-safe-deployment-in-large-scale-cloud-infrastructure\/\" target=\"_blank\" rel=\"noreferrer noopener\">safe deployment<\/a> is to catch a defective build early to prevent it from rolling out to production and resulting in significant customer impact. It can be extremely labor intensive and time consuming for engineers, because anomalous behaviors have a variety of patterns that may change over time, and not all anomalous behaviors are caused by a new build, which may introduce false positives. &nbsp;<\/p>\n\n\n\n<p>At Microsoft Research, we used transfer learning and active learning techniques to develop a safe deployment solution that overcomes these challenges. We\u2019ve been running the solution in Microsoft Azure, and it has been highly effective at helping to catch defective builds \u2013 achieving more than 90% precision and near 100% recall in production over a period of 18 months. &nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/fast-outage-analysis-of-large-scale-production-clouds-with-service-correlation-mining\/\" target=\"_blank\" rel=\"noreferrer noopener\">Root cause analysis<\/a> is another way that AIOps is reducing human operations in cloud systems. To shorten the mitigation time, engineers in cloud systems must quickly identify the root causes of emerging incidents. Owing to the complex structure of cloud systems, however, incidents often contain only partial information and can be triggered by many services and components simultaneously, which forces engineers to spend extra time diagnosing the root causes before any effective actions can be taken. &nbsp;By leveraging advanced contrast-mining algorithms, we have implemented autonomous incident-diagnosis systems, including <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/halo-hierarchy-aware-fault-localization-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">HALO<\/a> and Outage Scope, to reduce response time and increase accuracy in <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/onion-identifying-incident-indicating-logs-for-cloud-systems\/\" target=\"_blank\" rel=\"noreferrer noopener\">incident diagnosis tasks<\/a>. These systems have been integrated in both Azure and Microsoft 365 (M365), which has considerably improved engineers\u2019 ability to handle incidents in cloud systems.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"making-cloud-systems-more-proactive\">Making cloud systems more proactive&nbsp;<\/h3>\n\n\n\n<p>AIOps makes cloud systems more proactive by introducing the concept of proactive design. In the design of a proactive system, an ML-based prediction component is added to the traditional system. The prediction system takes the input signals, does the necessary processing, and outputs the future status of the system. For example, what the capacity status of cluster A looks like next week, whether a disk will fail in a few days, or how many virtual machines (VMs) of a particular type will be needed in the next hour.\u200b&nbsp;&nbsp;<\/p>\n\n\n\n<p>Knowing the future status makes it possible for the system to proactively avoid negative system impacts. For example, engineers can live migrate the services on an unhealthy computing node to a healthy one to reduce VM downtime, or <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/AAAI21_Provisioning.pdf\">pre-provision<\/a> a certain number of VMs of a particular type for the next hour to reduce the latency of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/UAHS_IJCAI_2020_updated.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">VM provisioning<\/a>.\u202fIn addition, AI\/ML techniques can enable systems to learn over time which decision to make. &nbsp;<\/p>\n\n\n\n<p>As an example of proactive design, we built a system called Narya, which <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/predictive-and-adaptive-failure-mitigation-to-avert-production-cloud-vm-interruptions-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">proactively mitigated potential hardware failures<\/a> to reduce service interruption and minimize customer impact. Narya, which is in production in Microsoft Azure, performs prediction on hardware failures and uses a bandit algorithm to decide which mitigation action to take.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"making-cloud-systems-more-manageable\">Making cloud systems more manageable&nbsp;<\/h3>\n\n\n\n<p>AIOps makes cloud systems more manageable by introducing the notion of tiered autonomy. Each tier represents a set of operations that require a certain level of human expertise and intervention. These tiers range from the top tier of autonomous routine operations to the bottom tier, which requires deep human expertise to respond to rare and complex problems.&nbsp;&nbsp;<\/p>\n\n\n\n<p>AI-driven automation often cannot handle such problems. By building AIOps solutions targeted at each tier, we can make cloud platforms easier to manage across the long tail of rare problems that inevitably arise in complex systems. Furthermore, the tiered design ensures that autonomous systems are developed from the start to evaluate certainty and risk, and that they have safe fallbacks when automation fails or the platform faces a previously unseen set of circumstances, such as the unforeseen increase in demand in 2020 due to the COVID-19 pandemic.&nbsp;<\/p>\n\n\n\n<p>As an example of tiered autonomy, we built <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sol-safe-on-node-learning-in-cloud-platforms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Safe On-Node Learning (SOL)<\/a>, a framework for safe learning and actuation on server nodes for the top tier. As another example, we are exploring how to predict the commands that operators should perform to mitigate incidents, while considering the associated certainty and risks of those commands when the top-tier automation fails to prevent the incidents.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"making-aiops-more-comprehensive-across-the-cloud-stack\">Making AIOps more comprehensive across the cloud stack<\/h3>\n\n\n\n<p>AIOps can also be made more comprehensive by spanning the cloud stack\u2014from the lowest infrastructure layers (such as network and storage) through the service layer (such as the scheduler and database) and on to the application layer. The benefit of applying AIOps more broadly would be a significant increase in the capability for holistic diagnosis, optimization, and management.&nbsp;<\/p>\n\n\n\n<p>Microsoft services built on top of Azure are called first-party (1P) services. A 1P setting, which is often used to optimize system resources, is particularly suited to a more comprehensive approach to AIOps. This is because with the 1P setting a single entity has visibility into, and control over, the layers of the cloud stack, which enables engineers to amplify the AIOps impact. Examples of 1P services at Microsoft include large and established services such as Office 365, relatively new but sizeable services such as Teams, and up and coming services such as Windows 365 Cloud PC. These 1P services typically account for a significant share of resource usage, such as wide-area network (WAN) traffic and compute cores.&nbsp;<\/p>\n\n\n\n<p>As an example of applying a more comprehensive AIOps approach to the 1P setting, the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/onecogs\/\" target=\"_blank\" rel=\"noreferrer noopener\">OneCOGS<\/a> project, which is a joint effort of Azure, M365, and MSR, considers three broad opportunities for optimization: &nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Modeling users and their workload using signals cutting across the layers\u2014such as using the user\u2019s messaging activity versus fixed working hours to predict when a Cloud PC user will be active\u2014thereby increasing accuracy to enable enabling appropriate allocation of system resources.&nbsp;<\/li>\n\n\n\n<li>Jointly <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/optimizing-network-provisioning-through-cooperation\/\" target=\"_blank\" rel=\"noreferrer noopener\">optimizing<\/a> the application and the infrastructure to achieve cost savings and more. &nbsp;<\/li>\n\n\n\n<li>Tame the complexity of data and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/SelfTune\" target=\"_blank\" rel=\"noopener noreferrer\">configuration<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, thereby democratizing AIOps. &nbsp;<\/li>\n<\/ol>\n\n\n\n<p>The AIOps methodologies, technologies and practices used for cloud computing platforms and 1P services are also applicable to third-party (3P) services on the cloud stack. To achieve this, further research and development are needed to make AIOps methods and techniques more general and\/or easily adaptable. For example, when operating cloud services, detecting anomalies in multi-dimensional space and the subsequent fault localization are common monitoring and diagnosis problems.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Motivated by the real-world needs of Azure and M365, we proposed the technique AiDice, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/efficient-incident-identification-from-multi-dimensional-issue-reports-via-meta-heuristic-search\/\" target=\"_blank\" rel=\"noreferrer noopener\">which automatically detects anomalies in multi-dimensional space<\/a>, and HALO, a hierarchy-aware approach to locating fault-indicating combinations that uses telemetry data collected from cloud systems. In addition to deploying AiDice and HALO in Azure and M365, we\u2019re also collaborating with product team partners to make AiDice and HALO AIOps services that can be leveraged by third-party services.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion&nbsp;<\/h2>\n\n\n\n<p>AIOps is a rapidly emerging technology trend and an interdisciplinary research direction across system, software engineering, and AI\/ML communities. With years of research on Cloud Intelligence, Microsoft Research has built up rich technology assets in detection, diagnosis, prediction, and optimization. And through close collaboration with Azure and M365, we have deployed some of our technologies in production, which has created significant improvements in the reliability, performance, and efficiency of Azure and M365 while increasing the productivity of developers working on these products. In addition, we are collaborating with colleagues in academia and industry to promote the AIOps research and practices. For example, with the joint efforts we have organized 3 editions of AIOps Workshop at premium academic conferences <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cloudintelligenceworkshop.org\/2020\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">AAAI 2020<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cloudintelligenceworkshop.org\/2021\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">ICSE 2021<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cloudintelligenceworkshop.org\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">MLSys2022<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n\n\n\n<p>Moving forward, we believe that as a new dimension of innovation, Cloud Intelligence\/AIOps will play an increasingly important role in making cloud systems more autonomous, more proactive, more manageable, and more comprehensive across the entire cloud stack. Ultimately, Cloud Intelligence\/AIOps will help us make our vision for the future of the cloud a reality.\u202f<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-toward-more-autonomous-and-proactive-cloud-technologies-with-ai\/\">Read part 2<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/automatic-post-deployment-management-of-cloud-applications\/\">Read part 3<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-ai-for-tiered-cloud-platform-operation\/\">Read part 4<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray\u2019s words, to \u201cbuild a system used by millions of people each day and yet administered and managed by [&hellip;]<\/p>\n","protected":false},"author":42183,"featured_media":897300,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Dongmei Zhang","user_id":"31665"},{"type":"user_nicename","value":"Venkat Padmanabhan","user_id":"33180"},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":"33393"},{"type":"user_nicename","value":"Qingwei Lin","user_id":"33318"},{"type":"user_nicename","value":"Ranjita Bhagwan","user_id":"31217"},{"type":"user_nicename","value":"Dan Crankshaw","user_id":"38428"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-896259","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144847],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Dongmei Zhang","user_id":31665,"display_name":"Dongmei Zhang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/dongmeiz\/\" aria-label=\"Visit the profile page for Dongmei Zhang\">Dongmei Zhang<\/a>","is_active":false,"last_first":"Zhang, Dongmei","people_section":0,"alias":"dongmeiz"},{"type":"user_nicename","value":"Venkat Padmanabhan","user_id":33180,"display_name":"Venkat Padmanabhan","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/padmanab\/\" aria-label=\"Visit the profile page for Venkat Padmanabhan\">Venkat Padmanabhan<\/a>","is_active":false,"last_first":"Padmanabhan, Venkat","people_section":0,"alias":"padmanab"},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":33393,"display_name":"Ricardo Bianchini","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ricardob\/\" aria-label=\"Visit the profile page for Ricardo Bianchini\">Ricardo Bianchini<\/a>","is_active":false,"last_first":"Bianchini, Ricardo","people_section":0,"alias":"ricardob"},{"type":"user_nicename","value":"Qingwei Lin","user_id":33318,"display_name":"Qingwei Lin \u6797\u5e86\u7ef4","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qlin\/\" aria-label=\"Visit the profile page for Qingwei Lin \u6797\u5e86\u7ef4\">Qingwei Lin \u6797\u5e86\u7ef4<\/a>","is_active":false,"last_first":"\u6797\u5e86\u7ef4, Qingwei Lin","people_section":0,"alias":"qlin"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-960x540.jpg\" class=\"img-object-cover\" alt=\"The image has two circles side-by-side, each divided into three equal segments. An arrow between the two circles points from left to right to show the evolution from Microsoft\u2019s previous Software Analytics research to today\u2019s Cloud Intelligence\/AIOps.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/11\/1400x788_AIOps_hero_image-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"November 10, 2022","formattedExcerpt":"When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray\u2019s words, to \u201cbuild a system used by millions&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/896259","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=896259"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/896259\/revisions"}],"predecessor-version":[{"id":964518,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/896259\/revisions\/964518"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/897300"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=896259"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=896259"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=896259"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=896259"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=896259"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=896259"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=896259"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=896259"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=896259"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=896259"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=896259"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}