{"id":1140974,"date":"2025-07-08T14:33:29","date_gmt":"2025-07-08T21:33:29","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1140974"},"modified":"2025-07-08T15:55:41","modified_gmt":"2025-07-08T22:55:41","slug":"phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai\/","title":{"rendered":"Phi-Reasoning: Once again redefining what is possible with small and efficient AI\u00a0"},"content":{"rendered":"\n<p>Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version with reinforcement learning (RL), delivers even higher performance by generating longer reasoning traces.&nbsp;<\/p>\n\n\n\n<p>Despite their smaller size (14B parameters), Phi-4-reasoning and Phi-4-reasoning-plus are competitive with or exceeding much larger open weight (QwQ-32B, DeepSeek R1- Distill-Llama-70B, DeepSeek-R1) and closed (o1-mini, Claude Sonnet 3.7) reasoning models across several benchmarks as shown in Figures 1, 3 and Tables 1, 2. Our extensive benchmarks span math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well.&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"357\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-1024x357.png\" alt=\"igure 1. Performance comparison on representative reasoning benchmarks spanning mathematics (HMMT, AIME 25, OmniMath), scientific (GPQA), and coding (LiveCodeBench 8\/24-1\/25) domains. \n\n\" class=\"wp-image-1144171\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-1024x357.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-300x105.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-768x268.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-1536x535.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-2048x714.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Screenshot-2025-07-08-at-1.50.40\u202fPM-240x84.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-gray-color has-text-color has-link-color wp-elements-edc4d5519c7b486d456ac2faa06df3d1\">Figure 1. Performance comparison on representative reasoning benchmarks spanning mathematics (HMMT, AIME 25, OmniMath), scientific (GPQA), and coding (LiveCodeBench 8\/24-1\/25) domains.&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Notably, Phi-4-reasoning and Phi-4-reasoning-plus achieve better performance than o1-mini, and DeepSeek-R1-Distill-Llama-70B at most benchmarks and achieve performance comparable to the full DeepSeek-R1 model (with 671B parameters) on AIME 2025<sup>1<\/sup> (the 2025 qualifier for the USA Math Olympiad). They also outperform Claude 3.7 Sonnet and Gemini 2 Flash Thinking on all tasks except GPQA (PhD-level STEM questions) and Calendar Planning.<\/p>\n\n\n\n<p><strong>More Potential with Parallel Test-time Scaling:<\/strong> As shown in Figure 2, our small-ish model nearly saturates performance on AIME 2025 with increasing parallel test-time compute (e.g., Majority @N), surpassing the pass@1 of the teacher (o3-mini).&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"584\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-1024x584.png\" alt=\"Figure 2: Effects of parallel test-time compute on AIME 2025\" class=\"wp-image-1144176\" style=\"width:663px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-1024x584.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-300x171.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-768x438.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-1536x876.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM-240x137.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.07.12\u202fPM.png 1975w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-gray-color has-text-color has-link-color wp-elements-af3bbb98fae36b13d16c257982ff3567\">Figure 2: Effects of parallel test-time compute on AIME 2025<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"354\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-1024x354.png\" alt=\"Average Pass@1 accuracy\" class=\"wp-image-1144172\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-1024x354.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-300x104.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-768x266.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-1536x531.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-2048x709.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.51.51\u202fPM-240x83.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-gray-color has-text-color has-link-color wp-elements-47c73677091a467a579a5fb8a4179e5e\">Table 1. Average Pass@1 accuracy on selected reasoning benchmarks. Bold denotes best model per benchmark and model class (open versus closed-weight), and underline denotes the second best. We report the standard deviation in parentheses where available.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"key-contributors-to-best-in-class-performance\">Key contributors to best-in-class performance&nbsp;<\/h2>\n\n\n\n<p>Below we summarize the core contributions that led to the superior performance of Phi-4-reasoning models. We provide more comprehensive technical details and experimentations surrounding each bullet point in our tech repot [1].&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Careful Data Curation: <\/strong>our reasoning prompts are specifically filtered to cover a range of difficulty levels and to lie at the boundary of the base model capabilities. Our approach aligns closely with data-centric methods of earlier Phi and Orca models [2,3,4,5,6,7,8], demonstrating that meticulous data curation and high-quality synthetic datasets allow smaller models to compete with larger counterparts. The datasets used in supervised finetuning include topics in STEM (science, technology, engineering, and mathematics), coding, and safety-focused tasks. Our reinforcement learning is conducted on a small set of high-quality math-focused problems with verifiable solutions.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Benefits of Supervised Finetuning (SFT):<\/strong> Phi-4-reasoning after the SFT stage already performs strongly across diverse benchmarks. Interestingly, the improvement in performance generalizes tasks not directly targeted in the training data\u2014such as calendar planning and general-purpose benchmarks (Table 2). We highlight the critical role of data mixture and training recipe in unlocking reasoning capabilities during the SFT stage, which goes hand-in-hand with our data selection and filtering.&nbsp;&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Boost with Reinforcement Learning<\/strong>: we are encouraged by the gains achieved through a short round of outcome-based reinforcement learning (RL) and the potential of combining distillation\/SFT and reinforcement learning. We observe that the model after RL provides higher accuracy on math while using approximately 1.5x more tokens than the SFT model on average, offering a trade-off between accuracy and inference-time compute.&nbsp;<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reasoning-is-a-meta-skill\">Reasoning is a meta skill&nbsp;<\/h2>\n\n\n\n<p>We think that reasoning is a transferable meta-skill that can be learned through supervised finetuning alone and further enhanced with reinforcement learning. To test the generalization of the models\u2019 reasoning capabilities, we evaluate them on multiple new reasoning benchmarks that require algorithmic problem solving and planning, including 3SAT (3-literal Satisfiability Problem), TSP (Traveling Salesman Problem), and BA-Calendar planning. These reasoning tasks are nominally out-of-domain for the models as the training process did not target these skills, but the models show strong generalization to these tasks as shown in Figure 2.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"3044\" height=\"1264\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM.png\" alt=\"Average pass@1 accuracy on general-purpose benchmarks\" class=\"wp-image-1144173\" style=\"width:900px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM.png 3044w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-300x125.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-1024x425.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-768x319.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-1536x638.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-2048x850.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-1.53.12\u202fPM-240x100.png 240w\" sizes=\"auto, (max-width: 3044px) 100vw, 3044px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-gray-color has-text-color has-link-color wp-elements-e416f8e68ee415bfe813d0b91288eb32\">Table 2. Average pass@1 accuracy on general-purpose benchmarks, averaged across five runs.&nbsp;<\/p>\n\n\n\n<p>This generalized improvement in capabilities also goes beyond reasoning. Without explicit training on non-reasoning tasks, we saw significant improvements on IFEval, FlenQA, and internal PhiBench as shown in Table 2. And despite limited coding data during the SFT stage (and none during RL), the model performs well, scoring at o1-mini level on LiveCodeBench (LCB) and Codeforces as shown in Table 1. We plan to emphasize coding further in our future versions.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"735\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-1024x735.png\" alt=\"Figure 3. Average Pass@1 performance on reasoning benchmarks, averaged across five runs. Except for GPQA, other benchmarks are out-of-distribution with respect to Phi-4-reasoning\u2019s training data.\u00a0\" class=\"wp-image-1144175\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-1024x735.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-300x215.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-768x551.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-1536x1103.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM-240x172.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.04.33\u202fPM.png 1687w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-gray-color has-text-color has-link-color wp-elements-669f8c3732192c071223b7229adea215\">Figure 3. Average Pass@1 performance on reasoning benchmarks, averaged across five runs. Except for GPQA, other benchmarks are out-of-distribution with respect to Phi-4-reasoning\u2019s training data.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"lessons-on-evaluating-reasoning-models\">Lessons on Evaluating Reasoning Models&nbsp;<\/h2>\n\n\n\n<p>Language models exhibit large generation nondeterminism, i.e., they may produce substantially different answers given the same prompts and inference hyperparameters (e.g., temperature). To account for this stochastic nature, we study the accuracy distribution on AIME 2025, approximated by kernel density estimation of 50 independent runs with the same prompt and temperature. We have found several interesting observations as illustrated in Figure 4:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>All models show a high accuracy variance. For example, accuracy of answers generated by DeepSeek-R1- Distill-Llama-70B ranges from 30% to 70%, while o3-mini\u2019s accuracy ranges from 70% to 100%. This suggests that any comparison among models using a single run can easily produce misleading conclusions.&nbsp;&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Models on the two extremes of average accuracy demonstrate more robust accuracy. For example, Phi-4-reasoning-plus and Phi-4 have relatively narrower accuracy ranges compared to DeepSeek-R1-Distill-Llama-70B and Phi-4-reasoning.&nbsp;&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>The accuracy distribution further indicates the competitive performance of Phi-4-reasoning-plus, largely intersecting with o3-mini\u2019s distribution and being almost disjoint from DeepSeek-R1-Distill-Llama-70B\u2019s distribution.&nbsp;&nbsp;<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2239\" height=\"822\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM.png\" alt=\"chart, line chart\" class=\"wp-image-1144174\" style=\"width:882px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM.png 2239w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-300x110.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-1024x376.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-768x282.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-1536x564.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-2048x752.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.03.15\u202fPM-240x88.png 240w\" sizes=\"auto, (max-width: 2239px) 100vw, 2239px\" \/><\/figure>\n\n\n\n<p class=\"has-text-align-center has-gray-color has-text-color has-link-color wp-elements-8653775c284901f60b9b402ed86594ce\">Figure 4. Distribution of pass@1 accuracy on AIME 2025, approximated by kernel density estimation over 50 runs with the same prompt and temperature. The accuracy distribution further shows the competitive performance of Phi-4-reasoning-plus, largely intersecting with o3-mini\u2019s distribution and being almost disjoint from the DeepSeek-R1-Distill-Llama-70B.&nbsp;&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"phi-4-reasoning-in-action\">Phi-4-Reasoning in action<\/h2>\n\n\n\n<p>Below we provide some interesting example responses from Phi-4-reasoning that showcases its intelligent behavior.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"868\" height=\"1024\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-868x1024.jpeg\" alt=\"Example  - calendar planning \" class=\"wp-image-1141150\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-868x1024.jpeg 868w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-254x300.jpeg 254w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-768x906.jpeg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-1302x1536.jpeg 1302w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image-153x180.jpeg 153w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/image.jpeg 1338w\" sizes=\"auto, (max-width: 868px) 100vw, 868px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"890\" height=\"1024\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM-890x1024.png\" alt=\"Example - ridde\" class=\"wp-image-1144178\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM-890x1024.png 890w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM-261x300.png 261w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM-768x884.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM-156x180.png 156w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Screenshot-2025-07-08-at-2.14.03\u202fPM.png 1150w\" sizes=\"auto, (max-width: 890px) 100vw, 890px\" \/><\/figure>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Prompt: &#8220;Generate a website for steves pc repairs using a single html script&#8221;<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1800 \/ 1080;\" width=\"1800\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/html_render.mp4\"><\/video><\/figure>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Prompt: &#8220;write a Python program that shows a ball bouncing inside a spinning triangle. The ball must bounce off the rotating walls realistically and should not leave the triangle&#8221;<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"762\" style=\"aspect-ratio: 958 \/ 762;\" width=\"958\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ball-and-spinning-triangle-phi-4-reasoning-plus.mp4\"><\/video><\/figure>\n\n\n\n<div style=\"height:0px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"references\">References<\/h2>\n\n\n\n<p>[1] \u201cPhi-4-reasoning Technical Report.\u201d\u202f<em>arXiv preprint arXiv:2504.21318<\/em>\u202f(2025). [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/2504.21318\" target=\"_blank\" rel=\"noopener noreferrer\">link<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]&nbsp;<\/p>\n\n\n\n<p>[2] &#8220;Phi-4 technical report.\u201d\u202f<em>arXiv preprint arXiv:2412.08905<\/em>\u202f(2024).&nbsp;<\/p>\n\n\n\n<p>[3] \u201cPhi-3 technical report: A highly capable language model locally on your phone.\u201d arXiv preprint arXiv:2404.14219 (2024).&nbsp;&nbsp;<\/p>\n\n\n\n<p>[4] \u201cPhi-2: The surprising power of small language models.\u201d Microsoft Research Blog (2023).&nbsp;<\/p>\n\n\n\n<p>[5] \u201cTextbooks are all you need.\u201d arXiv preprint arXiv:2306.11644 (2023).&nbsp;<\/p>\n\n\n\n<p>[6] \u201cAgentinstruct: Toward generative teaching with agentic flows.\u201d arXiv preprint arXiv:2407.03502 (2024).&nbsp;&nbsp;<\/p>\n\n\n\n<p>[7] \u201cOrca 2: Teaching small language models how to reason.\u201d arXiv preprint arXiv:2311.11045 (2023).&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>[8] \u201cOrca: Progressive learning from complex explanation traces of gpt-4.\u201d arXiv preprint arXiv:2306.02707 (2023).&nbsp;<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version with reinforcement learning (RL), delivers even higher performance by generating longer reasoning traces.&nbsp; Despite [&hellip;]<\/p>\n","protected":false},"author":43341,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":992148,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1140974","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":992148,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1140974","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43341"}],"version-history":[{"count":11,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1140974\/revisions"}],"predecessor-version":[{"id":1144187,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1140974\/revisions\/1144187"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1140974"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1140974"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1140974"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1140974"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}