{"id":1153391,"date":"2025-10-26T20:54:46","date_gmt":"2025-10-27T03:54:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1153391"},"modified":"2025-10-26T20:55:59","modified_gmt":"2025-10-27T03:55:59","slug":"opa-dpo-efficiently-minimizing-hallucinations-in-large-vision-language-models","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/opa-dpo-efficiently-minimizing-hallucinations-in-large-vision-language-models\/","title":{"rendered":"OPA-DPO: Efficiently minimizing hallucinations in large vision-language models"},"content":{"rendered":"\n<p>Large vision-language models are improving at describing images, yet hallucinations still erode trust by introducing contradictions and fabricated details that limit practical applications.<\/p>\n\n\n\n<p>In response, Microsoft Research Asia has developed <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mitigating-hallucinations-in-large-vision-language-models-via-dpo-on-policy-data-hold-the-key\/\">On-Policy Alignment DPO (OPA-DPO)<\/a>, a new algorithm that aligns expert feedback with the model\u2019s own output distribution before training begins. This \u201con-policy\u201d alignment slightly alters the model so that expert corrections are close to what the model would naturally produce. As a result, the model is more likely to learn from these expert demonstrations, rather than treating them as outliers to be ignored.<\/p>\n\n\n\n<p>Until now, most attempts to curb hallucinations have involved retraining models with extra data or applying filters to clean up their answers afterwards. While these approaches can help, they\u2019re computationally expensive and don\u2019t address the root issue: how models learn to distinguish accurate from misleading responses.<\/p>\n\n\n\n<p>Direct Preference Optimization (DPO) has recently emerged as a solution. It trains models to favor accurate responses by learning from pairs of good and bad examples. However, when DPO is applied to vision-language tasks, it\u2019s often inadequate because the expert-corrected examples differ too much from what the model would naturally generate, preventing effective learning.<\/p>\n\n\n\n<p>OPA-DPO addresses this by providing a simpler and more data-efficient way to reduce hallucinations while using less training data than previous methods. This work has been recognized with an oral presentation at CVPR 2025.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"limitations-of-current-dpo-methods\">Limitations of current DPO methods<\/h2>\n\n\n\n<p>Previous approaches fall into three categories:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Hallucination injection<\/strong>, which inject hallucinated fragments into standard responses. Preference pairs are then constructed by pairing standard responses with their corresponding hallucinated versions.<\/li>\n\n\n\n<li><strong>Hallucination recognition<\/strong>, where models generate responses and people or GPT-4\/4v identifies and correct hallucinations. Preference pairs are then constructed by pairing corrected responses with their original versions.<\/li>\n\n\n\n<li><strong>Self evolution<\/strong>, where models generate multiple responses and a hallucination-recognition model ranks them by severity. Preference pairs are constructed based on these ranking results.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1430\" height=\"707\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1.png\" alt=\"graphical user interface, application\" class=\"wp-image-1141041\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1.png 1430w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1-300x148.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1-1024x506.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1-768x380.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-1-240x119.png 240w\" sizes=\"auto, (max-width: 1430px) 100vw, 1430px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Three categories of previous approaches<\/figcaption><\/figure>\n\n\n\n<p>Among these, self-evolution tends to perform best, followed by recognition and then injection. However, all three approaches face limitations. Hallucination injection is weak because the fabricated content does not reflect the model\u2019s own tendencies. Self-evolution is more effective but computationally costly. Recognition, while seemingly the most intuitive, underperforms in practice because expert-edited responses are often too different from the model\u2019s natural outputs. Standard DPO struggles to learn from this \u201coff-policy\u201d data, leading to vanishing gradients and little improvement.<\/p>\n\n\n\n<p>These challenges highlight the need for a method that can incorporate expert corrections while staying aligned with the model\u2019s own output distribution.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"opa-dpo-breaking-convention-reshaping-alignment-strategy\">OPA-DPO: Breaking convention, reshaping alignment strategy<\/h2>\n\n\n\n<p>To address these challenges, OPA-DPO introduces an on-policy alignment step before DPO training. Using 4.8k training samples, OPA-DPO achieves state-of-the-art performance compared with the 16k required by previous state-of-the-art methods. This work was accepted as an oral presentation at CVPR 2025.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1431\" height=\"410\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6.png\" alt=\"chart, diagram\" class=\"wp-image-1141046\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6.png 1431w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6-300x86.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6-1024x293.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6-768x220.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-6-240x69.png 240w\" sizes=\"auto, (max-width: 1431px) 100vw, 1431px\" \/><figcaption class=\"wp-element-caption\">Figure 2. OPA-DPO implementation method &nbsp;<\/figcaption><\/figure>\n\n\n\n<p>OPA-DPO aligns a model\u2019s outputs with expert-preferred responses through a four-step process. First, it generates responses from the model using both the image and prompt. Next, expert feedback\u2014such as that from GPT-4v\u2014is used to finely edit these responses, correcting hallucinations while preserving accurate content.<\/p>\n\n\n\n<p>The edited and ground-truth responses are then used to fine-tune the data-producing model via LoRA-SFT, resulting in what is referred to as the OPA model. Finally, DPO training is performed on the OPA model, incorporating language, image, and anchor preference pairs. Among these stages, the OPA step has the greatest impact on performance. This process is shown in Figure 3.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"865\" height=\"819\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-7.png\" alt=\"diagram\" class=\"wp-image-1141047\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-7.png 865w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-7-300x284.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-7-768x727.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-7-190x180.png 190w\" sizes=\"auto, (max-width: 865px) 100vw, 865px\" \/><figcaption class=\"wp-element-caption\">Figure 3. OPA-DPO achieves alignment in four steps<\/figcaption><\/figure>\n\n\n\n<p>Researchers compared various DPO-based algorithms fine-tuned on LLaVA-1.5-7B and 13B. With only 4.8k training samples, OPA-DPO achieves state-of-the-art performance on 50% of hallucination metrics for LLaVA-Instruct-1.5-7B. This improves to 70% for LLaVA-Instruct-1.5-13B. OPA-DPO demonstrates particularly strong results on metrics that directly measure hallucination occurrence, such as CHAIR and HalRate. The results are shown in Table 1.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1431\" height=\"627\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8.png\" alt=\"table\" class=\"wp-image-1141048\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8.png 1431w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8-300x131.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8-1024x449.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8-768x337.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-8-240x105.png 240w\" sizes=\"auto, (max-width: 1431px) 100vw, 1431px\" \/><figcaption class=\"wp-element-caption\">Table 1. To fairly compare various RLAIF\/RLHF-enhanced LVLM algorithms, researchers used greedy-search algorithm to evaluate across multiple benchmarks, annotated sources to distinguish official reproductions from paper results, and bolded the best scores in each metric group.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluating-opa-dpo\">Evaluating OPA-DPO<\/h2>\n\n\n\n<p>To validate the importance of OPA and data volume, researchers conducted ablation studies. Even with 600 training samples, OPA-DPO performs better than most baseline algorithms on hallucination-related metrics. As the data volume increases, the performance of OPA-DPO steadily improves. Incorporating the OPA operation leads to a nearly 50% reduction in AMBER HalRate and Object-halCHAIRs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1431\" height=\"521\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9.png\" alt=\"chart, line chart\" class=\"wp-image-1141049\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9.png 1431w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9-300x109.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9-1024x373.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9-768x280.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-9-240x87.png 240w\" sizes=\"auto, (max-width: 1431px) 100vw, 1431px\" \/><figcaption class=\"wp-element-caption\">Figure 4. Impact of training data volume and OPA operation on OPA-DPO (ablation study)<\/figcaption><\/figure>\n\n\n\n<p>They also experimented with LLaVA-OneVision as the base model. Despite its detailed but redundant outputs and numerous hallucinations, OPA-DPO significantly improved hallucination metrics with 2.4k training samples, achieving a 43.2% reduction in HalRate and a 38.7% improvement in CHAIR scores compared to baseline models.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1469\" height=\"448\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10.png\" alt=\"table\" class=\"wp-image-1141050\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10.png 1469w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10-300x91.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10-1024x312.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10-768x234.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-10-240x73.png 240w\" sizes=\"auto, (max-width: 1469px) 100vw, 1469px\" \/><figcaption class=\"wp-element-caption\">Table 2. Experimental results of OPA-DPO on LLaVA-OneVision<\/figcaption><\/figure>\n\n\n\n<p>OPA-DPO-trained models tend to adopt a conservative strategy, emphasizing salient and verifiable observations while minimizing attention to ambiguous or less relevant details. As illustrated in Figure 5, this approach focuses the description on the actions of the three individuals at the center of the image, while deliberately ignoring peripheral elements such as trees and minor details like backpacks that are speculated by base models. By avoiding speculative or overly detailed content that could introduce hallucinations, the models prioritize clarity and reliability\u2014contributing to their improved performance on hallucination metrics.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1425\" height=\"751\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11.png\" alt=\"Impact of OPA operation on model output in image description tasks\" class=\"wp-image-1141051\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11.png 1425w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11-300x158.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11-1024x540.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11-768x405.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-11-240x126.png 240w\" sizes=\"auto, (max-width: 1425px) 100vw, 1425px\" \/><figcaption class=\"wp-element-caption\">Figure 5. Impact of OPA operation on model output in image description tasks<\/figcaption><\/figure>\n\n\n\n<p>Interestingly, base models often assume the query language is accurate, even when it contains hallucinations, leading to responses that reinforce false premises. In contrast, OPA-DPO-trained models demonstrate the ability to detect and reject hallucinated content embedded in the query itself. As shown in Figure 6, this approach can identify fabricated elements\u2014such as the mention of \u201chands\u201d in the input prompt\u2014and respond with clarifications or corrections rather than perpetuating the hallucination.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1429\" height=\"504\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12.png\" alt=\"graphical user interface, text, application\" class=\"wp-image-1141052\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12.png 1429w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12-300x106.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12-1024x361.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12-768x271.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/opa-dpo-12-240x85.png 240w\" sizes=\"auto, (max-width: 1429px) 100vw, 1429px\" \/><figcaption class=\"wp-element-caption\">Figure 6. In erroneous premise inquiry tasks, models trained with OPA-DPO show the ability to identify hallucinations in the query.<\/figcaption><\/figure>\n\n\n\n<p>OPA-DPO not only improves algorithm performance but also advances multimodal alignment methods. Its approach of generating on-policy data from expert feedback marks a step forward in multimodal alignment training.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Large vision-language models are improving at describing images, yet hallucinations still erode trust by introducing contradictions and fabricated details that limit practical applications. In response, Microsoft Research Asia has developed On-Policy Alignment DPO (OPA-DPO), a new algorithm that aligns expert feedback with the model\u2019s own output distribution before training begins. This \u201con-policy\u201d alignment slightly alters [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1141056,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":0,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1153391","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_assoc_parent":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1153391","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1153391\/revisions"}],"predecessor-version":[{"id":1153433,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1153391\/revisions\/1153433"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1141056"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1153391"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1153391"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1153391"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1153391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}