Teaching small language models to think like optimization experts with OptiMind

March 3, 2026
Anson Ho, Microsoft; Xinzhi Zhang
Microsoft Research Forum | Season 2, Episode 3

OptiMind is a specialized language model that translates natural-language problem descriptions directly into solver-ready mathematical optimization formulations. This removes one of the most expertise-intensive bottlenecks in optimization workflows and makes advanced optimization more accessible.

Explore more

OptiMind: TeachOptiMind: Teaching LLMs to Think Like Optimization Experts
OptiMind: A small language model with optimization expertise
Access OptiMind: Microsoft Foundry (opens in new tab) | GitHub (opens in new tab) | Hugging Face (opens in new tab)

All Research Forum sessions

Transcript

Teaching small language models to think like optimization experts with OptiMind

Optimization problems usually start in plain language notes, constraints, and ideas written by humans, turning those into clean, solver ready math is often the hardest part.

I have the pleasure of working with closely our next speaker, Xinzhi, a research intern at MSR Redmond and a PhD student at the University of Washington. She’ll introduce OptiMind, a specialized language model designed to close that gap by translating natural language problem descriptions directly into formal optimization models.

This work pushes the state of the art and AI reasoning, while also improving reliability in systems where correctness really matters. Let’s dive into the details and new capabilities of OptiMind.

Hi everyone! I’m Xinzhi, a research intern in the Machine Learning and Optimization group here at Microsoft Research, and a PhD student at the University of Washington. Optimization is the engine behind the world’s most critical systems, from global supply chains to logistics planning and vehicle routing.

Today, I want to introduce OptiMind, which is our new 20 billion parameter reasoning model designed to translate natural language problems into a mixed integer linear programing formulations and a corresponding software code. Essentially, it acts like an operations research expert to solve complex tasks in real life.

In this project, which was a collaboration with the amazing ML team here at Microsoft Research, we made two core contributions. First, we develop a way to clean up incredibly noisy optimization data using the expert knowledge. And second, we train a small model that matches the frontier level performance.

The task of OptiMind works as follows A user describes in natural language a problem that involves some decisions that should be optimized. For example, please provide me an optimal production plan to maximize the profits across six months, given the specific capacity constraints and the role of the language model is to generate the mathematical formulation corresponding to the problem and in the form of an executable server code so that the user could run and get an optimal solution.

And when we expect the existing datasets and benchmarks in this domain, we found severe data quality issues. This applies to both the training data set and the benchmarks used for testing. And in fact, we saw some training sets with up to 50% of the problematic instances, which include missing constraints, ambiguous descriptions, or wrong solutions.

What we could manually clean the test benchmarks to ensure rigorous evaluations. Fixing the massive training sets is much more difficult. Training on the dirty data was like teaching a student with a textbook full of mistakes. So the first research question became, with such noisy data, how can we still train a competitive and reliable model?

To solve this, we started by analyzing how models fail when answering the training questions. We noticed that within specific optimization classes, models make the same structural mistakes repeatedly. Take the traveling salesman problem as an example. Models consistently mishandle the software elimination constraints. Specifically, they often incorrectly apply the constraints to the starting node and the result. This generates infeasible and disconnected loops instead of one continuous route.

However, we found that if we add specific hints that could explicitly instruct the model how to handle the starting node, it would fix the logic. The model will follow the hint and generate a correct and feasible formulation, and this gives us a key insight. We don’t need to manually fix every data point, and instead our experts can identify these failure modes and write a library of expert hints and think of hints as a guard rail.

Simple instructions like enforcing flow conservation or sending the big M number as the maximum variable instead of a fixed number. We warn the model about these specific pitfalls before it even starts generating the solution. So we utilize the hints in two workflows. First, for data cleaning. We first classify the noisy training data and pair it with the specific hints.

We then feed this to a teacher model, which uses the hint to reason correctly and regenerate a clean solution. And finally, apply majority voting to ensure robustness. This use a rigorous and expert quality dataset with minimal human intervention. We then use this clean data to supervise fine tuning our base model, which is GPT OSS 20 billion.

And crucially, we repeat this pipeline at inference time. When the user query comes in, we first classify the problem type by identifying it, for example, as a traveling salesman problem. We then retrieve the specific expert hint and feed it with both the user’s question and hint into our model to solve the problem.

And if the compute budget allows, we go a step further. We utilize the error messages from the solver to perform the multi-tenant self-correction, ensuring the final code is executable and valid. So our results are very compelling on the clean benchmarks IndustryOR, Mamo-Complex and OptMATH, OptiMind consistently outperform the other open-source reasoning models under 32 billion parameters by at least 10%.

And when we compare against frontier models like GPT 5 and O4 mini, we see that with just five tons of self-correction, we measure the performance and sometimes even outperform them. And remember, we are achieving this frontier level performance with only 20 billion parameters. The small scale of a model is also critical for local deployments. It allows organizations to keep sensitive supply chain data privacy on their own GPUs, and makes the research reproducible and accessible for the community.

So, to summarize, we prove that you can train a competitive domain specific model starting from the noisy data by first distilling expert knowledge into reusable hints. And we believe that applying optimization in industrial practice should be a community effort. So we have open source, our model, the cleaned benchmarks and our examples of experiments so the others can build on top of optimized. And looking ahead, we’re also working on fully automated pipelines using frontier models so this approach can easily adapt to new areas such as cloud efficiency, calendar scheduling, urban planning, and more. While we focused on standard optimization families, we hope the community will use our methods to explore new and less standard optimization domains.

So I would like to thank all of you for tuning into this talk. Here are some resources if you’re interested to learn more about this project and start using the models.