{"id":1171602,"date":"2026-05-14T10:05:55","date_gmt":"2026-05-14T17:05:55","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-video&#038;p=1171602"},"modified":"2026-05-14T10:05:57","modified_gmt":"2026-05-14T17:05:57","slug":"new-fine-tuning-of-language-models-match-meaning-not-tokens","status":"publish","type":"msr-video","link":"https:\/\/www.microsoft.com\/en-us\/research\/video\/new-fine-tuning-of-language-models-match-meaning-not-tokens\/","title":{"rendered":"New fine-tuning of language models: Match meaning, not tokens"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Language models are usually trained to predict the next word, but that does not always lead to the best overall answers. We introduce energy-based fine-tuning, a new method that trains models to produce better full responses, leading to stronger results without the need for complex reward models or verifiers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading h4\" id=\"explore-more\">Explore more<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/matching-features-not-tokens-energy-based-fine-tuning-of-language-models\/\">Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models<\/a><\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/sjelassi\/ebft_openrlhf\" target=\"_blank\" rel=\"noopener noreferrer\">Access code on GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/microsoft-research-forum\/past-episodes\/\">All Research Forum sessions<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-cta\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"http:\/\/aka.ms\/researchforum-register\">Register for the series<\/a><\/div>\n<\/div>\n\n\n<div class=\"wp-block-msr-show-more\">\n\t<div class=\"bg-neutral-100 p-5\">\n\t\t<div class=\"show-more-show-less\">\n\t\t\t<div>\n\t\t\t\t<span>\n\t\t\t\t\t\n\n<h2 class=\"wp-block-heading\" id=\"transcript\">Transcript<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>New fine-tuning of language models: Match meaning, not tokens<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[MUSIC]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[MUSIC FADES INTO\u202fSWEEPING SOUND]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>YASH LARA:<\/strong> Most language models are still optimized around predicting the next token, even though that doesn\u2019t always lead to the best overall response.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s hear from Carles in our New England lab about energy-based fine-tuning, a different approach that trains models to optimize meaning across an entire response without relying on complex reward models or external verifiers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It\u2019s a clean, principled idea with big implications for how we train and deploy models going forward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Over to you, Carles.<\/p>\n\n\n\n\t\t\t\t<\/span>\n\t\t\t\t<span id=\"show-more-show-less-toggle-1\" class=\"show-more-show-less-toggleable-content\">\n\t\t\t\t\t\n\n\n\n<p class=\"wp-block-paragraph\">[MUSIC]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[MUSIC FADES INTO\u202fSWEEPING SOUND]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>CARLES DOMINGO-ENRICH:<\/strong> Hi, this is Carles. I\u2019m a Senior Researcher at Microsoft Research New England, and I\u2019ll be talking about energy-based fine-tuning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This work focuses on training large language models, so I\u2019ll start with an overview of pre-training and post-training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In pre-training, the most commonly used approach is next-token prediction with cross-entropy loss. In post-training, there are several phases, starting with next-token prediction in the form of mid-training and supervised fine-tuning (SFT), followed by reinforcement learning (RL) fine-tuning\u2014either from human preferences (RLHF) or with verifiable rewards (RLVR).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s compare next-token prediction with RL using a translation example. The input might be: \u201cTranslate to French: \u2018The cat is sleeping,\u2019\u201d and the output would be \u201cle chat dort.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With next-token prediction, the model is evaluated token by token, and each contributes to the overall loss. With RL, we generate outputs (rollouts), score them with a reward model, and use that signal to update the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Both approaches have pros and cons. Next-token prediction offers stable training, dense signal, and strong parallelization, but suffers from imitation bias and distribution shift, since it trains only on ground-truth context.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">RL reduces distribution shift by training on model-generated outputs and allows explicit alignment, but it suffers from sparse signal, reduced parallelizability, and requires a reward model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our goal is to find a middle ground\u2014an approach that encourages diverse generations, is robust to distribution shifts, provides denser signal than RL, scales well, and does not require a reward model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our idea is to use feature maps defined over sequences of tokens. We copy the model we want to train, extract activation values at different layers as features, and define a feature-based moment-matching loss. We then compute rewards from this and optimize using policy gradients.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this setup, the ground-truth sequence is compared with model-generated outputs using this feature-matching loss.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The feature-matching loss measures how well the model\u2019s distribution matches the ground-truth distribution in an embedding space. We sample context from ground truth and compare the conditional distributions between ground truth and model outputs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, computing expectations over the full ground-truth distribution is intractable, so we approximate it using available training pairs. Importantly, this approximation preserves the gradients we need.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now let\u2019s look at the full algorithm.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Given a context like \u201cThe kids were excited because\u2026\u201d and a ground-truth completion such as \u201cit was the last day of school,\u201d we generate multiple candidate completions from the model\u2014for example, \u201cthe summer break was starting,\u201d \u201cthe circus was in town,\u201d or \u201cthe weather was nice.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We pass these through a feature network to obtain feature vectors, which are used to compute the feature-matching loss and derive rewards. These rewards are then used to update the model via policy gradients.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s look at results.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Energy-based fine-tuning (EBFT) achieves better cross-entropy loss than SFT and RLVR\u2014even though it does not directly optimize for that objective. It also achieves better downstream performance than SFT and is comparable to RLVR, without needing correctness-based rewards.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The feature-matching loss also correlates with cross-entropy but captures long-range calibration across full sequences rather than focusing on individual tokens.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These results hold across multiple domains, including question answering, coding, and translation. In unstructured coding scenarios, EBFT outperforms both SFT and RLVR in cross-entropy and feature matching, and often matches or exceeds RLVR on downstream tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I\u2019d like to thank my collaborators and the Microsoft Research environment, which enables high-risk, high-reward research. In this case, that effort has paid off\u2014EBFT is already being used internally at Microsoft to fine-tune models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019d love to hear your feedback. Please check out the project repository and website for more details.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thank you for listening.<\/p>\n\n\t\t\t\t<\/span>\n\t\t\t<\/div>\n\t\t\t<button\n\t\t\t\tclass=\"action-trigger glyph-prepend mt-2 mb-0 show-more-show-less-toggle\"\n\t\t\t\taria-expanded=\"false\"\n\t\t\t\tdata-show-less-text=\"Show less\"\n\t\t\t\ttype=\"button\"\n\t\t\t\taria-controls=\"show-more-show-less-toggle-1\"\n\t\t\t\taria-label=\"Show more content\"\n\t\t\t\tdata-alternate-aria-label=\"Show less content\">\n\t\t\t\tShow more\t\t\t<\/button>\n\t\t<\/div>\n\t<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Language models are usually trained to predict the next word, but that does not always lead to the best overall answers. We introduce energy-based fine-tuning, a new method that trains models to produce better full responses, leading to stronger results without the need for complex reward models or verifiers.<\/p>\n","protected":false},"featured_media":1171930,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr_hide_image_in_river":0,"footnotes":""},"research-area":[13556],"msr-video-type":[268311],"msr-locale":[268875],"msr-post-option":[],"msr-session-type":[256174],"msr-impact-theme":[],"msr-pillar":[],"msr-episode":[270329],"msr-research-theme":[270109],"class_list":["post-1171602","msr-video","type-msr-video","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-video-type-microsoft-research-forum","msr-locale-en_us"],"msr_download_urls":"","msr_external_url":"https:\/\/youtu.be\/8efKuAWVCMs","msr_secondary_video_url":"","msr_video_file":"http:\/\/0","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/1171602","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-video"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/1171602\/revisions"}],"predecessor-version":[{"id":1172014,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/1171602\/revisions\/1172014"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1171930"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1171602"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1171602"},{"taxonomy":"msr-video-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video-type?post=1171602"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1171602"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1171602"},{"taxonomy":"msr-session-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-session-type?post=1171602"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1171602"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1171602"},{"taxonomy":"msr-episode","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-episode?post=1171602"},{"taxonomy":"msr-research-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-theme?post=1171602"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}