{"id":986238,"date":"2023-11-28T17:43:34","date_gmt":"2023-11-29T01:43:34","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=986238"},"modified":"2023-11-28T17:43:35","modified_gmt":"2023-11-29T01:43:35","slug":"the-power-of-prompting","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-power-of-prompting\/","title":{"rendered":"The Power of Prompting"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1.jpg\" alt=\"Illustrated icons of a medical bag, hexagon with circles at its points, and a chat bubble on a blue and purple gradient background.\" class=\"wp-image-986379\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Today, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/can-generalist-foundation-models-outcompete-special-purpose-tuning-case-study-in-medicine\/\" target=\"_blank\" rel=\"noreferrer noopener\">we published an exploration<\/a> of the power of prompting strategies that demonstrates how the generalist GPT-4 model can perform as a specialist on medical challenge problem benchmarks. The study shows GPT-4\u2019s ability to outperform a leading model that was fine-tuned specifically for medical applications, on the same benchmarks and by a significant margin. These results are among other recent studies that show how prompting strategies alone can be effective in evoking this kind of domain-specific expertise from generalist foundation models. &nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1540\" height=\"784\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp.png\" alt=\"A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.\" class=\"wp-image-986991\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp.png 1540w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp-300x153.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp-1024x521.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp-768x391.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp-1536x782.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/Medqa-comp-240x122.png 240w\" sizes=\"auto, (max-width: 1540px) 100vw, 1540px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Visual illustration of Medprompt components and additive contributions to performance on the MedQA benchmark. Prompting strategy combines kNN-based few-shot example selection, GPT-4\u2013generated chain-of-thought prompting, and answer-choice shuffled ensembling.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">During <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4\/\" target=\"_blank\" rel=\"noreferrer noopener\">early evaluations<\/a> of the capabilities of GPT-4, we were excited to see glimmers of general problem-solving skills, with surprising <em>polymathic<\/em> capabilities of abstraction, generalization, and composition\u2014including the ability to weave together concepts across disciplines. Beyond these general reasoning powers, we discovered that GPT-4 could be steered via prompting to serve as a domain-specific specialist in numerous areas. Previously, eliciting these capabilities required fine-tuning the language models with specially curated data to achieve top performance in specific domains. This poses the question of whether more extensive training of generalist foundation models might reduce the need for fine-tuning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/capabilities-of-gpt-4-on-medical-challenge-problems\/\" target=\"_blank\" rel=\"noreferrer noopener\">a study shared in March<\/a>, we demonstrated how very simple prompting strategies revealed GPT-4\u2019s strengths in medical knowledge without special fine-tuning. The results showed how the \u201cout-of-the-box\u201d model could ace a battery of medical challenge problems with basic prompts. In our more recent study, we show how the composition of several prompting strategies into a method that we refer to as \u201cMedprompt\u201d can efficiently steer GPT-4 to achieve top performance. In particular, we find that GPT-4 with Medprompt:\u00a0<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Surpasses 90% on MedQA dataset for the first time<\/li>\n\n\n\n<li>Achieves top reported results on all nine benchmark datasets in the MultiMedQA suite<\/li>\n\n\n\n<li>Reduces error rate on MedQA by 27% over that reported by MedPaLM 2&nbsp;<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image alignleft size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"5330\" height=\"2475\" data-id=\"987216\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1.png\" alt=\"Two charts. The chart on the left compares the performance of models using no fine-tuning or intensive fine-tuning on the MedQA benchmark. GPT-4 (Medprompt) achieves the highest result at 90.2 using no fine-tuning. Med PaLM 2 achieves 86.5 using intensive fine-tuning. These are followed by GPT-4 base at 86.1 (no fine-tuning), GPT-4 (Simple Prompt) at 81.7 (no fine-tuning), Med PaLM at 67.2 (intensive fine-tuning), GPT-3.5 base at 60.2 (no fine-tuning), BioMedLM at 50.3 (intensive fine-tuning), DRAGON at 47.5 (intensive fine-tuning), BioLinkBERT at 45.1 (intensive fine-tuning), and PubMedBERT at 38.1 (intensive fine-tuning). The chart on the right compares GPT-4 (Medprompt), Med PaLM 2, and GPT-4 (Simple Prompt) model performance on medical challenge problems. GPT-4 with MedPrompt achieves state-of-the-art results on MedQA US (4-option), MedMCQA Dev, PubMedQA Reasoning Required, MMLU Clinical Knowledge, MMLU Medical Genetics MMLU Anatomy, MMLU Professional Medicine, MMLU College Biology, and MMLU College Medicine outperforming Med PaLM 2, and GPT-4 (Simple Prompt).\" class=\"wp-image-987216\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1.png 5330w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-300x139.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-1024x475.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-768x357.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-1536x713.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-2048x951.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/joint_medprompt_v1-240x111.png 240w\" sizes=\"auto, (max-width: 5330px) 100vw, 5330px\" \/><\/figure>\n<figcaption class=\"blocks-gallery-caption wp-element-caption\">Figure 2: (Left) Comparison of performance on MedQA. (Right) GPT-4 with Medprompt achieves state-of-the-art performance on a wide range of medical challenge problems.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Many AI practitioners assume that specialty-centric fine-tuning is required to extend generalist foundation models to perform well on specific domains. While fine-tuning can boost performance, the process can be expensive.&nbsp;Fine-tuning often requires experts or professionally labeled datasets (e.g., via top clinicians in the MedPaLM project) and then computing model parameter updates. The process can be resource-intensive and cost-prohibitive, making the approach a difficult challenge for many small and medium-sized organizations. The Medprompt study shows the value of more deeply exploring prompting possibilities for transforming generalist models into specialists and extending the benefits of these models to new domains and applications. In an intriguing finding, the prompting methods we present appear to be valuable, without any domain-specific updates to the prompting strategy, across professional competency exams in a diversity of domains, including&nbsp;electrical engineering, machine learning, philosophy, accounting, law, and psychology.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At Microsoft, we\u2019ve been working on the best ways to harness the latest advances in large language models across our products and services while keeping a careful focus on understanding and addressing potential issues with the reliability, safety, and usability of applications. It\u2019s been inspirational to see all the creativity, and the careful integration and testing of prototypes, as we continue the journey to share new AI developments with our partners and customers.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"997\" height=\"814\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/new_domain_performance_radar.png\" alt=\"A chart shows GPT-4 performance using three different prompting strategies on out of domain datasets. GPT-4 out performs zero-shot and five-shot approaches across MMLU Machine Learning, MMLU Professional Psychology, MMLU Electrical Engineering, MMLU Philosophy, MMLU Professional Law, MMLU Accounting, NCLEX RegisteredNursing.com, and NCLEX Nurselabs.\" class=\"wp-image-987027\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/new_domain_performance_radar.png 997w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/new_domain_performance_radar-300x245.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/new_domain_performance_radar-768x627.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/new_domain_performance_radar-220x180.png 220w\" sizes=\"auto, (max-width: 997px) 100vw, 997px\" \/><figcaption class=\"wp-element-caption\">Figure 3: GPT-4 performance with three different prompting strategies on out-of-domain datasets. Zero-shot and five-shot approaches represent baselines.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Chief Scientific Officer Eric Horvitz explains how new prompting strategies can enable generalist large language models like GPT-4 to achieve exceptional expertise in specific domains, such as medicine, and outperform fine-tuned specialist models.<\/p>\n","protected":false},"author":42183,"featured_media":986379,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Eric Horvitz","user_id":"32033"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13553],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-986238","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[971055],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Eric Horvitz","user_id":32033,"display_name":"Eric Horvitz","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\" aria-label=\"Visit the profile page for Eric Horvitz\">Eric Horvitz<\/a>","is_active":false,"last_first":"Horvitz, Eric","people_section":0,"alias":"horvitz"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Illustrated icons of a medical bag, hexagon with circles at its points, and a chat bubble on a blue and purple gradient background.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/MedPrompt-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\" title=\"Go to researcher profile for Eric Horvitz\" aria-label=\"Go to researcher profile for Eric Horvitz\" data-bi-type=\"byline author\" data-bi-cN=\"Eric Horvitz\">Eric Horvitz<\/a>","formattedDate":"November 28, 2023","formattedExcerpt":"Microsoft Chief Scientific Officer Eric Horvitz explains how new prompting strategies can enable generalist large language models like GPT-4 to achieve exceptional expertise in specific domains, such as medicine, and outperform fine-tuned specialist models.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/986238","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=986238"}],"version-history":[{"count":26,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/986238\/revisions"}],"predecessor-version":[{"id":987906,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/986238\/revisions\/987906"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/986379"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=986238"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=986238"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=986238"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=986238"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=986238"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=986238"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=986238"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=986238"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=986238"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=986238"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=986238"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}