Pushing boundaries of complex reasoning in small language models

September 24, 2025
Maya Murad, Microsoft; Mojan Javaheripi, Microsoft

Mojan Javaheripi, Member of Technical Staff at Microsoft Research AI Frontiers, presents Phi-4-Reasoning and Phi-4-Reasoning-Plus—two 14B models designed to advance complex reasoning in small-scale language models. By introducing a dedicated “thinking block” and applying supervised fine-tuning and reinforcement learning on carefully curated STEM datasets, these models achieve major improvements in problem-solving capabilities.

Explore more

Phi-4-reasoning (opens in new tab) | Phi-4-reasoning-plus (opens in new tab)
AI Model Catalog

Phi-4-reasoning (opens in new tab) | Phi-4-reasoning-plus (opens in new tab) | Phi-4-gguf (opens in new tab)
HuggingFace

Phi_4_reasoning Technical Report (PDF)
April 2025

All Research Forum sessions

Transcript

Pushing boundaries of complex reasoning in small language models

[MUSIC]

MAYA MURAD: Mojan is a senior researcher and fellow lab mate from AI Frontiers, based in Redmond. Mojan has been at the heart of developing the Phi suite of small language models, which have made a real splash in the open-source community for their balance of efficiency and capability.

Today, she’ll walk us through the latest milestone, the Phi-4-reasoning model suite, which are trained to reason step by step through complex math, science, and coding problems. It’s a great example of how efficient language models can punch above their weight. And because they’re lightweight, they can even run locally on your own laptop or phone. That makes advanced reasoning more efficient and accessible to a broader set of use cases. Over to you, Mojan.

[MUSIC]

MOJAN JAVAHERIPI: Hi, everyone. My name is Mojan, and I’m a researcher at Microsoft Research AI Frontiers lab, where we have been working on several generations of small language models and pushing the frontier of what is possible.

Today, I wanted to talk to you about how we are teaching reasoning to small language models and specifically our recently released model, Phi-4-reasoning. This has been a collaboration with an amazing team at MSR. Before going into the details, I wanted to give a quick overview of reasoning models and their core characteristics.

These models follow explicit logical steps in order to solve [a] problem, much like how humans think. And they work by iterative refinement, meaning that they backtrack and double-check that their answers are correct in order to make sure that there was no mistake. And their biggest difference with generic models is that they create a thinking trace, much like a scratchpad, that clearly explains all of the analysis and trial and error that the model is making for solving the problem. And this is especially helpful when we want to understand how the model arrived at a particular solution and also verify the outcome.

Because of these properties, the reasoning models are a natural fit for solving complex problems. These include strategic planning and multi-step solutions in agentic applications. And they’re also very good at algorithmic thinking; for example, for coding or solving scientific problems as a tutor. And they’re suitable for logical analysis and evaluating complex arguments like in constraint satisfaction. Now, let’s take a look at the Phi-4-reasoning models.

Late last year, we released the Phi-4 non-reasoning [model], which is a 14-billion parameter transformer, and all of our reasoning models are built on top of this base through a specific post-training process. There are two models in the reasoning family. Phi-4-reasoning is fine-tuned on more than 1.4 million STEM and coding questions, and Phi-4-reasoning-plus is a further enhanced version that has a short reinforcement learning stage.

What is really special about Phi-4-reasoning is that they are competing in terms of task performance with models that are five to 50 times larger their size, and at 14-billion parameters, they can efficiently run on commodity hardware like laptops. And we did a comprehensive evaluation of Phi-4-reasoning on various domains like math, scientific reasoning, coding, problem solving, planning, and spatial understanding, and we found it to be better across the board compared to Llama 70B distilled with DeepSeek-R1 and o1-mini while outperforming the original DeepSeek-R1, Claude 3.7 Sonnet, and Gemini 2 Thinking on most of the tasks. And we find reasoning to be a transferable meta-skill, meaning that we are seeing improvements compared to the base non-reasoning model, even on the domains that we haven’t trained for. And in order to maximize the performance of our models, we have three fundamental pillars and steps in our design.

The first step is data curation, and it starts by collecting and filtering high volumes of data. And here we are really focused on four different strategies.

The first one is that the prompts should be “teachable,” so they’re selected to have optimal complexity and diversity to teach the most content to the model. And the prompts are selected to be at the edge of the base model capability in order to boost the performance to the max. And we also have a high bar for quality control, which is why we heavily filter all of our data. And lastly, we make sure that we cover a lot of the domains that we want the model to be good at, such as STEM, coding, and also safety-focused tasks.

Once we’ve gathered our prompts and questions, we generate detailed reasoning chain of thoughts using o3-mini as a teacher model, and these high-quality step-by-step demonstrations are shown to the model during training to achieve similar problem-solving capabilities. And we also focus on data mixture and do a strategic combination of different data domains and difficulty levels to make sure that the model knows how to solve different problem types.

Once the data is ready, we perform supervised fine-tuning, where we had a multi-stage experimental process. The first stage is exploration, where we are establishing the training recipe, and we focus on doing a lot of ablations and experiments on single domain data sources, and we tune our training hyperparameters and the general training algorithm. We had a couple of key insights during this exploration stage. The first one is that synthetic data improves the final answer quality a lot, and we find it to be very helpful as we’ve outlined in the technical report. We also noticed that the reasoning-specific system message boosts the consistency and performance of the model, and that is why we encourage everyone to use the system prompt that the model was trained with.

We also noticed that as the model becomes better and the quality improves, the response length decreases as the thinking process becomes more efficient and coherent. And last but not least, we noticed that there is an additive property across different domains where we can individually optimize the data sources within one domain and then confidently combine all of the domains together in one final run.

And that brings us to the second stage of our experiments, which is the scaling up of both training and inference. So here we combine all of the data from all of these different domains that we’ve gathered and we’ve optimized previously in a long training run, and we also use a better teacher model and allow an inference-time scaling for the teacher to generate longer responses. So this long context allows our model to spend more time thinking to solve more complex problems.

The final stage of our training is a short reinforcement learning using outcome-based rewards, which allow the model to learn from its own success and failure patterns, and we select a subset of math problems in order to conduct this reinforcement learning over. And what we observed is that reinforcement learning actually allows the model to generate, on average, 1.5 times longer responses, which contain more detailed step-by-step solutions. So this gives us a tradeoff between accuracy and efficiency, where the reinforcement learning checkpoint provides higher accuracy at the cost of increased tokens.

And we’ve been seeing good adoption by the community for our models due to their competitive performance across domains and their small footprint. So the community also appreciated the clear chain of thought that the model generates and the visibility of the multi-step planning. We’ve also been seeing great contributions by the community for various quantized versions of our model that further push the efficiency, and we really emphasize on the value of open weight, which not only allows for free and accessible inference but also allows others to build on top of our models and our technology to push the frontier even further.

I would like to thank all of you for tuning into this talk, and here are some more resources if you’re interested to learn more about our project and start using the models.