{"id":939198,"date":"2023-05-16T09:00:00","date_gmt":"2023-05-16T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=939198"},"modified":"2023-05-11T10:59:06","modified_gmt":"2023-05-11T17:59:06","slug":"large-language-models-for-automatic-cloud-incident-management","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/large-language-models-for-automatic-cloud-incident-management\/","title":{"rendered":"Large-language models for automatic cloud incident management"},"content":{"rendered":"\n<p class=\"has-text-align-center h6\"><em>This research was accepted by the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/conf.researchr.org\/home\/icse-2023\" target=\"_blank\" rel=\"noopener noreferrer\">IEEE\/ACM International Conference on Software Engineering (ICSE)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering.<\/em><\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>The Microsoft 365 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/systems-innovation\/\" target=\"_blank\" rel=\"noreferrer noopener\">Systems Innovation<\/a> research group has a paper accepted at the 45<sup>th<\/sup> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/conf.researchr.org\/home\/icse-2023\" target=\"_blank\" rel=\"noopener noreferrer\">International Conference on Software Engineering (ICSE)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, widely recognized as one of the most prestigious research conferences on software engineering. This paper, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/recommending-root-cause-and-mitigation-steps-for-cloud-incidents-using-large-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models<\/a>, focuses on using state-of-the-art large language models (LLMs) to help generate recommendations for cloud incident root cause analysis and mitigation plans. With a rigorous study on real production incidents and analysis of several LLMs in different settings using semantic and lexical metrics as well as human evaluation, the research shows the efficacy and future potential of using AI for resolving cloud incidents.<\/p>\n\n\n\n<h2 id=\"challenges-of-building-reliable-cloud-services\" class=\"wp-block-heading\">Challenges of building reliable cloud services<\/h2>\n\n\n\n<p>Building highly reliable hyperscale cloud services such as Microsoft 365 (M365), which supports the productivity of hundreds of thousands of organizations, is very challenging. This includes the challenge of quickly <em>detecting incidents<\/em>, then performing <em>root cause analysis<\/em> and <em>mitigation<\/em>.<\/p>\n\n\n\n<p>Our recent research starts with understanding the fundamentals of production incidents: we analyze the life cycle of incidents, then determine the common root causes, mitigations, and engineering efforts for resolution. In a previous paper: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/how-to-fight-production-incidents-an-empirical-study-on-a-large-scale-cloud-service\/\" target=\"_blank\" rel=\"noreferrer noopener\">How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service<\/a>, which won a Best Paper award at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/acmsocc.org\/2022\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">SoCC\u201922<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we provide a comprehensive, multi-dimensional empirical study of production incidents from Microsoft Teams. From this study, we envision that automation should support incident diagnosis and help identify the root cause and mitigation steps to quickly resolve an incident and minimize customer impact. We should also leverage past lessons to build resilience for future incidents. <em>We posit that adopting AIOps and using state-of-the-art AI\/ML technologies can help achieve both goals, as we show in the ICSE paper.<\/em><\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144028\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/the-ai-revolution-in-medicine-revisited\/\" aria-label=\"The AI Revolution in Medicine, Revisited\" data-bi-cN=\"The AI Revolution in Medicine, Revisited\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Episode7-PeterBillSebastien-AIRevolution_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshot of Bill Gates, Peter Lee, and S\u00e9bastien Bubeck\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">The AI Revolution in Medicine, Revisited<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"the-ai-revolution-in-medicine-revisited\" class=\"large\">Join Microsoft\u2019s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/the-ai-revolution-in-medicine-revisited\/\" aria-describedby=\"the-ai-revolution-in-medicine-revisited\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"The AI Revolution in Medicine, Revisited\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 id=\"adapting-large-language-models-for-automated-incident-management\" class=\"wp-block-heading\">Adapting large-language models for automated incident management<\/h2>\n\n\n\n<p>Recent breakthroughs in AI have enabled LLMs to develop a rich understanding of natural language. They can understand and reason over large volumes of data and complete a diverse set of tasks, such as code completion, translation, and Q&A. Given the complexities of incident management, we sought to evaluate the effectiveness of LLMs in analyzing the root cause of production incidents and generating mitigation steps.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788.jpg\" alt=\"A block diagram that shows using title and summary of the incidents as input through GPT-3.x models that generate root cause and mitigation recommendations. \" class=\"wp-image-939207\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Leveraging GPT-3.x for root cause analysis and mitigation<\/figcaption><\/figure>\n\n\n\n<p>In our recently published <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/recommending-root-cause-and-mitigation-steps-for-cloud-incidents-using-large-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">ICSE paper<\/a>, <strong>we demonstrated the usefulness of LLMs for production incident diagnosis <\/strong>for the first time. When an incident ticket is created, the author specifies a title for each incident created and describes any relevant details, such as error messages, anomalous behavior, and other details which might help with resolution. We used the title and the summary of a given incident as the input for LLMs and generated root cause and mitigation steps, as shown in Figure 1.<\/p>\n\n\n\n<p>We did a rigorous study on more than 40,000 incidents generated from more than 1000 services and compared several LLMs in zero-shot, fine-tuned, and multi-task settings. We find that fine-tuning the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.<\/p>\n\n\n\n<h3 id=\"effectiveness-of-gpt-3-x-models-at-finding-root-causes\" class=\"wp-block-heading\">Effectiveness of GPT-3.x models at finding root causes<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td class=\"has-text-align-left\" data-align=\"left\" rowspan=\"2\">Model<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">BLEU-4<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">ROUGE-L<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">METEOR<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">BERTScore<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">BLEURT<\/td><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"2\">NUBIA<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top1<\/td><td class=\"has-text-align-center\" data-align=\"center\">Top5<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">RoBERTa<\/td><td class=\"has-text-align-center\" data-align=\"center\">4.21<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">12.83<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">9.89<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">85.38<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">35.66<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">33.94<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">CodeBERT<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.38<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">10.17<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.58<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">84.88<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\">33.19<\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>39.05<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">NA<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Curie<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.40<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.29<\/td><td class=\"has-text-align-center\" data-align=\"center\">19.04<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.44<\/td><td class=\"has-text-align-center\" data-align=\"center\">7.21<\/td><td class=\"has-text-align-center\" data-align=\"center\">13.65<\/td><td class=\"has-text-align-center\" data-align=\"center\">84.90<\/td><td class=\"has-text-align-center\" data-align=\"center\">86.36<\/td><td class=\"has-text-align-center\" data-align=\"center\">32.62<\/td><td class=\"has-text-align-center\" data-align=\"center\">40.08<\/td><td class=\"has-text-align-center\" data-align=\"center\">33.52<\/td><td class=\"has-text-align-center\" data-align=\"center\">49.76<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Codex<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.44<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.25<\/td><td class=\"has-text-align-center\" data-align=\"center\">8.98<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.51<\/td><td class=\"has-text-align-center\" data-align=\"center\">7.33<\/td><td class=\"has-text-align-center\" data-align=\"center\">13.82<\/td><td class=\"has-text-align-center\" data-align=\"center\">84.85<\/td><td class=\"has-text-align-center\" data-align=\"center\">86.33<\/td><td class=\"has-text-align-center\" data-align=\"center\">32.50<\/td><td class=\"has-text-align-center\" data-align=\"center\">40.11<\/td><td class=\"has-text-align-center\" data-align=\"center\">33.64<\/td><td class=\"has-text-align-center\" data-align=\"center\">49.77<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Davinci<\/td><td class=\"has-text-align-center\" data-align=\"center\">3.34<\/td><td class=\"has-text-align-center\" data-align=\"center\">5.94<\/td><td class=\"has-text-align-center\" data-align=\"center\">8.53<\/td><td class=\"has-text-align-center\" data-align=\"center\">15.10<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.67<\/td><td class=\"has-text-align-center\" data-align=\"center\">12.95<\/td><td class=\"has-text-align-center\" data-align=\"center\">83.13<\/td><td class=\"has-text-align-center\" data-align=\"center\">84.41<\/td><td class=\"has-text-align-center\" data-align=\"center\">31.06<\/td><td class=\"has-text-align-center\" data-align=\"center\">38.61<\/td><td class=\"has-text-align-center\" data-align=\"center\">35.28<\/td><td class=\"has-text-align-center\" data-align=\"center\">50.79<\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">Davinci-002<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>4.24<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>7.15<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>11.43<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>17.2<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>10.42<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>16.8<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>85.42<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>86.78<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>36.77<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>42.87<\/strong><\/td><td class=\"has-text-align-center\" data-align=\"center\">32.3<\/td><td class=\"has-text-align-center\" data-align=\"center\"><strong>51.34<\/strong><\/td><\/tr><tr><td class=\"has-text-align-left\" data-align=\"left\">%gain for Davinci-002<\/td><td class=\"has-text-align-center\" data-align=\"center\">23.26<\/td><td class=\"has-text-align-center\" data-align=\"center\">13.67<\/td><td class=\"has-text-align-center\" data-align=\"center\">26.44<\/td><td class=\"has-text-align-center\" data-align=\"center\">10.90<\/td><td class=\"has-text-align-center\" data-align=\"center\">42.16<\/td><td class=\"has-text-align-center\" data-align=\"center\">21.56<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.61<\/td><td class=\"has-text-align-center\" data-align=\"center\">0.49<\/td><td class=\"has-text-align-center\" data-align=\"center\">12.72<\/td><td class=\"has-text-align-center\" data-align=\"center\">6.88<\/td><td class=\"has-text-align-center\" data-align=\"center\">-8.45<\/td><td class=\"has-text-align-center\" data-align=\"center\">1.08<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><center>Table 1: Lexical and semantic performance of different LLMs<\/center><\/figcaption><\/figure>\n\n\n\n<p>In our offline evaluation, we compared the performance of GPT-3.5 against three GPT-3 models by computing several semantic and lexical metrics (which measures the text similarity) between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The <em>average <\/em>gains for GPT-3.5 metrics for different tasks were as follows:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li>For root cause and mitigation recommendation tasks<strong>, Davinci-002 (GPT-3.5) provided at least 15.38% and 11.9% gains over all the GPT-3 models<\/strong>, respectively, as shown in Table 1.<\/li>\n\n\n\n<li>When we generated mitigation plans by adding root cause as input to the model, GPT-3.5 model provided at least an 11.16% gain over the GPT-3 models.<\/li>\n\n\n\n<li>LLMs performed better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs), due to the repetitive nature of the MRIs.<\/li>\n\n\n\n<li>Fine-tuning LLMs with incident data improved performance significantly. A fine-tuned GPT-3.5 model improved the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.<\/li>\n<\/ol>\n\n\n\n<h3 id=\"looking-through-the-incident-owners-eyes\" class=\"wp-block-heading\">Looking through the incident owners\u2019 eyes<\/h3>\n\n\n\n<p>In addition to analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of the generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in a majority of the metrics. <strong>More than 70% of on-call engineers gave a rating of 3 out of 5 or better for the usefulness of recommendations in a real-time production setting.<\/strong><\/p>\n\n\n\n<h2 id=\"looking-forward\" class=\"wp-block-heading\">Looking forward<\/h2>\n\n\n\n<p>With future versions of LLMs coming, we expect the performance for automatic incident resolution will further improve, and the need for fine-tuning may decrease. Yet we are in the initial stage, with many open research questions in this field. For instance, how can we incorporate additional context about the incident, such as discussion entries, logs, service metrics, and even dependency graphs of the impacted services to improve the diagnosis? Another challenge is staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest LLMs combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface, as shown in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"664\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n.jpg\" alt=\"A workflow diagram that shows how to use retrieval augmentation approach to recommend root causes. This approach including a retriever and corpus to retrieve relevant information from historical incidents, troubleshooting guides, and engineering hub, to add context into LLMs.\" class=\"wp-image-939219\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n-300x125.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n-1024x425.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n-768x319.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n-1536x637.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE-AIOps-figure-2n-240x100.jpg 240w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Workflow of retrieval-augmented root cause analysis<\/figcaption><\/figure>\n\n\n\n<p>Moreover, ChatGPT can be actively integrated into the &#8220;discussion&#8221; of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion and accelerating the incident resolution process. We believe this could deliver a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation, thereby reducing significant human effort required and bolstering reliability and customer satisfaction<em>.<\/em><\/p>\n\n\n\n<h3 id=\"acknowledgement\" class=\"wp-block-heading\">Acknowledgement<\/h3>\n\n\n\n<p><em>This post includes contributions from <\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/sites.google.com\/site\/toufiqueparag\/home\" target=\"_blank\" rel=\"noopener noreferrer\"><em>Toufique Ahmed<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em> during his internship at Microsoft.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This research was accepted by the IEEE\/ACM International Conference on Software Engineering (ICSE) (opens in new tab), which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering. The Microsoft 365 Systems Innovation research group has a paper [&hellip;]<\/p>\n","protected":false},"author":42183,"featured_media":939207,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Rujia Wang","user_id":"42549"},{"type":"user_nicename","value":"Chetan Bansal","user_id":"31394"},{"type":"user_nicename","value":"Supriyo GHOSH","user_id":"41179"},{"type":"user_nicename","value":"Tom Zimmermann","user_id":"34453"},{"type":"user_nicename","value":"Xuchao Zhang","user_id":"42045"},{"type":"user_nicename","value":"Saravan Rajmohan","user_id":"41039"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545,13560],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-939198","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[793670,811276],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Rujia Wang","user_id":42549,"display_name":"Rujia Wang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rujiawang\/\" aria-label=\"Visit the profile page for Rujia Wang\">Rujia Wang<\/a>","is_active":false,"last_first":"Wang, Rujia","people_section":0,"alias":"rujiawang"},{"type":"user_nicename","value":"Chetan Bansal","user_id":31394,"display_name":"Chetan Bansal","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chetanb\/\" aria-label=\"Visit the profile page for Chetan Bansal\">Chetan Bansal<\/a>","is_active":false,"last_first":"Bansal, Chetan","people_section":0,"alias":"chetanb"},{"type":"user_nicename","value":"Xuchao Zhang","user_id":42045,"display_name":"Xuchao Zhang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xuchaozhang\/\" aria-label=\"Visit the profile page for Xuchao Zhang\">Xuchao Zhang<\/a>","is_active":false,"last_first":"Zhang, Xuchao","people_section":0,"alias":"xuchaozhang"},{"type":"user_nicename","value":"Saravan Rajmohan","user_id":41039,"display_name":"Saravan Rajmohan","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/saravar\/\" aria-label=\"Visit the profile page for Saravan Rajmohan\">Saravan Rajmohan<\/a>","is_active":false,"last_first":"Rajmohan, Saravan","people_section":0,"alias":"saravar"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-960x540.jpg\" class=\"img-object-cover\" alt=\"asdf\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/05\/ICSE_AIOps_StagnantHero_1400x788.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"May 16, 2023","formattedExcerpt":"This research was accepted by the IEEE\/ACM International Conference on Software Engineering (ICSE) (opens in new tab), which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering.&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/939198","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=939198"}],"version-history":[{"count":25,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/939198\/revisions"}],"predecessor-version":[{"id":940383,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/939198\/revisions\/940383"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/939207"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=939198"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=939198"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=939198"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=939198"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=939198"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=939198"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=939198"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=939198"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=939198"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=939198"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=939198"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}