New fine-tuning of language models: Match meaning, not tokens

May 14, 2026
Yash Lara, Microsoft; Carles Domingo-Enrich, Microsoft
Microsoft Research Forum | Season 2, Episode 4

Language models are usually trained to predict the next word, but that does not always lead to the best overall answers. We introduce energy-based fine-tuning, a new method that trains models to produce better full responses, leading to stronger results without the need for complex reward models or verifiers.

Explore more

All Research Forum sessions

Transcript

New fine-tuning of language models: Match meaning, not tokens

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

YASH LARA: Most language models are still optimized around predicting the next token, even though that doesn’t always lead to the best overall response.

Let’s hear from Carles in our New England lab about energy-based fine-tuning, a different approach that trains models to optimize meaning across an entire response without relying on complex reward models or external verifiers.

It’s a clean, principled idea with big implications for how we train and deploy models going forward.

Over to you, Carles.

[MUSIC]

[MUSIC FADES INTO SWEEPING SOUND]

CARLES DOMINGO-ENRICH: Hi, this is Carles. I’m a Senior Researcher at Microsoft Research New England, and I’ll be talking about energy-based fine-tuning.

This work focuses on training large language models, so I’ll start with an overview of pre-training and post-training.

In pre-training, the most commonly used approach is next-token prediction with cross-entropy loss. In post-training, there are several phases, starting with next-token prediction in the form of mid-training and supervised fine-tuning (SFT), followed by reinforcement learning (RL) fine-tuning—either from human preferences (RLHF) or with verifiable rewards (RLVR).

Let’s compare next-token prediction with RL using a translation example. The input might be: “Translate to French: ‘The cat is sleeping,’” and the output would be “le chat dort.”

With next-token prediction, the model is evaluated token by token, and each contributes to the overall loss. With RL, we generate outputs (rollouts), score them with a reward model, and use that signal to update the model.

Both approaches have pros and cons. Next-token prediction offers stable training, dense signal, and strong parallelization, but suffers from imitation bias and distribution shift, since it trains only on ground-truth context.

RL reduces distribution shift by training on model-generated outputs and allows explicit alignment, but it suffers from sparse signal, reduced parallelizability, and requires a reward model.

Our goal is to find a middle ground—an approach that encourages diverse generations, is robust to distribution shifts, provides denser signal than RL, scales well, and does not require a reward model.

Our idea is to use feature maps defined over sequences of tokens. We copy the model we want to train, extract activation values at different layers as features, and define a feature-based moment-matching loss. We then compute rewards from this and optimize using policy gradients.

In this setup, the ground-truth sequence is compared with model-generated outputs using this feature-matching loss.

The feature-matching loss measures how well the model’s distribution matches the ground-truth distribution in an embedding space. We sample context from ground truth and compare the conditional distributions between ground truth and model outputs.

In practice, computing expectations over the full ground-truth distribution is intractable, so we approximate it using available training pairs. Importantly, this approximation preserves the gradients we need.

Now let’s look at the full algorithm.

Given a context like “The kids were excited because…” and a ground-truth completion such as “it was the last day of school,” we generate multiple candidate completions from the model—for example, “the summer break was starting,” “the circus was in town,” or “the weather was nice.”

We pass these through a feature network to obtain feature vectors, which are used to compute the feature-matching loss and derive rewards. These rewards are then used to update the model via policy gradients.

Let’s look at results.

Energy-based fine-tuning (EBFT) achieves better cross-entropy loss than SFT and RLVR—even though it does not directly optimize for that objective. It also achieves better downstream performance than SFT and is comparable to RLVR, without needing correctness-based rewards.

The feature-matching loss also correlates with cross-entropy but captures long-range calibration across full sequences rather than focusing on individual tokens.

These results hold across multiple domains, including question answering, coding, and translation. In unstructured coding scenarios, EBFT outperforms both SFT and RLVR in cross-entropy and feature matching, and often matches or exceeds RLVR on downstream tasks.

I’d like to thank my collaborators and the Microsoft Research environment, which enables high-risk, high-reward research. In this case, that effort has paid off—EBFT is already being used internally at Microsoft to fine-tune models.

We’d love to hear your feedback. Please check out the project repository and website for more details.

Thank you for listening.