{"id":991758,"date":"2023-12-12T06:40:31","date_gmt":"2023-12-12T14:40:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=991758"},"modified":"2023-12-14T13:13:08","modified_gmt":"2023-12-14T21:13:08","slug":"steering-at-the-frontier-extending-the-power-of-prompting","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/steering-at-the-frontier-extending-the-power-of-prompting\/","title":{"rendered":"Steering at the Frontier: Extending the Power of Prompting"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1.jpg\" alt=\"three conversation bubbles on a blue, purple, and pink gradient background\" class=\"wp-image-991947\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been exploring new prompting strategies, showcased in our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/the-power-of-prompting\/\">Medprompt<\/a> work, to evoke the powers of specialists.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Today, we\u2019re sharing information on Medprompt and other approaches to steering frontier models in <em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/promptbase\" target=\"_blank\" rel=\"noopener noreferrer\">promptbase<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em>, a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We\u2019ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We\u2019ll be adding more sophisticated general-purpose tools and information over the coming weeks.&nbsp;&nbsp;<\/p>\n\n\n\n<p>As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we\u2019ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>Let\u2019s focus on the well-known <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2009.03300\" target=\"_blank\" rel=\"noopener noreferrer\">MMLU<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models.&nbsp; The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/can-generalist-foundation-models-outcompete-special-purpose-tuning-case-study-in-medicine\/\">Medprompt study<\/a>, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks\u2014despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves <em>the highest score ever achieved on the complete MMLU.<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1215\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt.jpg\" alt=\"A graph showing the reported performance of baseline multiple models and methods on the MMLU benchmark. Moving from left to right, Palm 2-L (5-shot) achieved 78.4% accuracy, Claude 2 (5-shot CoT) achieved 78.5% accuracy, Inflection-2 (5-shot) achieved 79.6% accuracy, Google Pro (CoT@8) achieved 79.13% accuracy, Gemini Ultra (CoT@32) achieved 90.04% accuracy, GPT-4-1106 (5-Shot) achieved 86.4% accuracy, GPT-4-1106 (Medprompt @ 5) achieved 89.1% accuracy, GPT-4-1106 (Medprompt @ 20) achieved 89.56% accuracy, and GPT-4-1106 (Medprompt @ 31) achieved 90.10% accuracy. \" class=\"wp-image-991926\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt.jpg 1215w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt-300x142.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt-1024x485.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt-768x364.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/gemini_medprompt-240x114.jpg 240w\" sizes=\"auto, (max-width: 1215px) 100vw, 1215px\" \/><figcaption class=\"wp-element-caption\">Figure1. Reported performance of multiple models and methods on the MMLU benchmark.<\/figcaption><\/figure>\n\n\n\n<p>In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.<\/p>\n\n\n\n<p>While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It\u2019s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Benchmark<\/th><th>GPT-4 Prompt<\/th><th>GPT-4 Results<\/th><th>Gemini Ultra Results<\/th><\/tr><\/thead><tbody><tr><td>MMLU<\/td><td>Medprompt+<\/td><td><strong>90.10%<\/strong><\/td><td>90.04%<\/td><\/tr><tr><td>GSM8K<\/td><td>Zero-shot<\/td><td><strong>95.27%<\/strong><\/td><td>94.4%<\/td><\/tr><tr><td>MATH<\/td><td>Zero-shot<\/td><td><strong>68.42%<\/strong><\/td><td>53.2%<\/td><\/tr><tr><td>HumanEval<\/td><td>Zero-shot<\/td><td><strong>87.8<\/strong>%<\/td><td>74.4%<\/td><\/tr><tr><td>BIG-Bench-Hard<\/td><td>Few-shot + CoT*<\/td><td><strong>89.0%<\/strong><\/td><td>83.6%&nbsp;<\/td><\/tr><tr><td>DROP<\/td><td>Zero-shot + CoT<\/td><td><strong>83.7%<\/strong><\/td><td>82.4%<\/td><\/tr><tr><td>HellaSwag<\/td><td>10-shot**<\/td><td><strong>95.3%**<\/strong><\/td><td>87.8%<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><center><sup>* followed the norm of evaluations and used standard few-shot examples from dataset creators&nbsp;<br>** source: Google&nbsp;<\/sup><br>Table 1: Model, strategies, and results<\/center><\/figcaption><\/figure>\n\n\n\n<p>We encourage you to check out the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/promptbase\" target=\"_blank\" rel=\"noopener noreferrer\">promptbase repo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We\u2019re excited about the directions and possibilities ahead.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been exploring new prompting strategies, showcased in our Medprompt work, to [&hellip;]<\/p>\n","protected":false},"author":42735,"featured_media":991947,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Eric Horvitz","user_id":"32033"},{"type":"user_nicename","value":"Harsha Nori","user_id":"41461"},{"type":"user_nicename","value":"Yin Tat Lee","user_id":"42684"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[264846,261673],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-991758","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":["Computing foundations","Health"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[968280],"related-researchers":[{"type":"user_nicename","value":"Eric Horvitz","user_id":32033,"display_name":"Eric Horvitz","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\" aria-label=\"Visit the profile page for Eric Horvitz\">Eric Horvitz<\/a>","is_active":false,"last_first":"Horvitz, Eric","people_section":0,"alias":"horvitz"},{"type":"user_nicename","value":"Harsha Nori","user_id":41461,"display_name":"Harsha Nori","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanori\/\" aria-label=\"Visit the profile page for Harsha Nori\">Harsha Nori<\/a>","is_active":false,"last_first":"Nori, Harsha","people_section":0,"alias":"hanori"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"three conversation bubbles on a blue, purple, and pink gradient background\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Steeering-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\" title=\"Go to researcher profile for Eric Horvitz\" aria-label=\"Go to researcher profile for Eric Horvitz\" data-bi-type=\"byline author\" data-bi-cN=\"Eric Horvitz\">Eric Horvitz<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hanori\/\" title=\"Go to researcher profile for Harsha Nori\" aria-label=\"Go to researcher profile for Harsha Nori\" data-bi-type=\"byline author\" data-bi-cN=\"Harsha Nori\">Harsha Nori<\/a>, and Yin Tat Lee","formattedDate":"December 12, 2023","formattedExcerpt":"We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=991758"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758\/revisions"}],"predecessor-version":[{"id":992049,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758\/revisions\/992049"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/991947"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=991758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=991758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=991758"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=991758"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=991758"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=991758"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=991758"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=991758"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=991758"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=991758"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=991758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}