Large vision-language models are improving at describing images, yet hallucinations still erode trust by introducing contradictions and fabricated details that limit practical applications.
In response, Microsoft Research Asia has developed On-Policy Alignment DPO (OPA-DPO), a new algorithm that aligns expert feedback with the model’s own output distribution before training begins. This “on-policy” alignment slightly alters the model so that expert corrections are close to what the model would naturally produce. As a result, the model is more likely to learn from these expert demonstrations, rather than treating them as outliers to be ignored.
Until now, most attempts to curb hallucinations have involved retraining models with extra data or applying filters to clean up their answers afterwards. While these approaches can help, they’re computationally expensive and don’t address the root issue: how models learn to distinguish accurate from misleading responses.
Direct Preference Optimization (DPO) has recently emerged as a solution. It trains models to favor accurate responses by learning from pairs of good and bad examples. However, when DPO is applied to vision-language tasks, it’s often inadequate because the expert-corrected examples differ too much from what the model would naturally generate, preventing effective learning.
OPA-DPO addresses this by providing a simpler and more data-efficient way to reduce hallucinations while using less training data than previous methods. This work has been recognized with an oral presentation at CVPR 2025.
Limitations of current DPO methods
Previous approaches fall into three categories:
- Hallucination injection, which inject hallucinated fragments into standard responses. Preference pairs are then constructed by pairing standard responses with their corresponding hallucinated versions.
- Hallucination recognition, where models generate responses and people or GPT-4/4v identifies and correct hallucinations. Preference pairs are then constructed by pairing corrected responses with their original versions.
- Self evolution, where models generate multiple responses and a hallucination-recognition model ranks them by severity. Preference pairs are constructed based on these ranking results.

Among these, self-evolution tends to perform best, followed by recognition and then injection. However, all three approaches face limitations. Hallucination injection is weak because the fabricated content does not reflect the model’s own tendencies. Self-evolution is more effective but computationally costly. Recognition, while seemingly the most intuitive, underperforms in practice because expert-edited responses are often too different from the model’s natural outputs. Standard DPO struggles to learn from this “off-policy” data, leading to vanishing gradients and little improvement.
These challenges highlight the need for a method that can incorporate expert corrections while staying aligned with the model’s own output distribution.
OPA-DPO: Breaking convention, reshaping alignment strategy
To address these challenges, OPA-DPO introduces an on-policy alignment step before DPO training. Using 4.8k training samples, OPA-DPO achieves state-of-the-art performance compared with the 16k required by previous state-of-the-art methods. This work was accepted as an oral presentation at CVPR 2025.

OPA-DPO aligns a model’s outputs with expert-preferred responses through a four-step process. First, it generates responses from the model using both the image and prompt. Next, expert feedback—such as that from GPT-4v—is used to finely edit these responses, correcting hallucinations while preserving accurate content.
The edited and ground-truth responses are then used to fine-tune the data-producing model via LoRA-SFT, resulting in what is referred to as the OPA model. Finally, DPO training is performed on the OPA model, incorporating language, image, and anchor preference pairs. Among these stages, the OPA step has the greatest impact on performance. This process is shown in Figure 3.

Researchers compared various DPO-based algorithms fine-tuned on LLaVA-1.5-7B and 13B. With only 4.8k training samples, OPA-DPO achieves state-of-the-art performance on 50% of hallucination metrics for LLaVA-Instruct-1.5-7B. This improves to 70% for LLaVA-Instruct-1.5-13B. OPA-DPO demonstrates particularly strong results on metrics that directly measure hallucination occurrence, such as CHAIR and HalRate. The results are shown in Table 1.

Evaluating OPA-DPO
To validate the importance of OPA and data volume, researchers conducted ablation studies. Even with 600 training samples, OPA-DPO performs better than most baseline algorithms on hallucination-related metrics. As the data volume increases, the performance of OPA-DPO steadily improves. Incorporating the OPA operation leads to a nearly 50% reduction in AMBER HalRate and Object-halCHAIRs.

They also experimented with LLaVA-OneVision as the base model. Despite its detailed but redundant outputs and numerous hallucinations, OPA-DPO significantly improved hallucination metrics with 2.4k training samples, achieving a 43.2% reduction in HalRate and a 38.7% improvement in CHAIR scores compared to baseline models.

OPA-DPO-trained models tend to adopt a conservative strategy, emphasizing salient and verifiable observations while minimizing attention to ambiguous or less relevant details. As illustrated in Figure 5, this approach focuses the description on the actions of the three individuals at the center of the image, while deliberately ignoring peripheral elements such as trees and minor details like backpacks that are speculated by base models. By avoiding speculative or overly detailed content that could introduce hallucinations, the models prioritize clarity and reliability—contributing to their improved performance on hallucination metrics.

Interestingly, base models often assume the query language is accurate, even when it contains hallucinations, leading to responses that reinforce false premises. In contrast, OPA-DPO-trained models demonstrate the ability to detect and reject hallucinated content embedded in the query itself. As shown in Figure 6, this approach can identify fabricated elements—such as the mention of “hands” in the input prompt—and respond with clarifications or corrections rather than perpetuating the hallucination.

OPA-DPO not only improves algorithm performance but also advances multimodal alignment methods. Its approach of generating on-policy data from expert feedback marks a step forward in multimodal alignment training.