NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

NeurIPS 2025 |

Group Relative Policy Optimization (GRPO) fine-tuning has been empirically shown to significantly enhance the reasoning abilities of language models. However, it often relies on large-scale, high-quality labeled data, which is typically difficult to obtain. To address this challenge, we introduce the Noise-Aware Dual-Reward Optimization (NaDRO) , which effectively enhances LLMs training in environments where data is noisy or imperfect. NaDRO operates through two key components: \textbf{(1) Preference-based Outcome Reward (POR)}, which extracts reliable preference signals from noisy data, guiding LLMs towards more effective decisions instead of relying on specific noisy scores; and \textbf{(2) a Context Perception Reward (CPR) mechanism}, which ensures that LLMs conduct necessary qualitative assessment of the current problem state, rewarding accurate judgments to foster better cognitive understanding before decision-making. In the context of combinatorial optimization problems, where dynamically selecting heuristic algorithms is challenging due to large problem scales and the difficulty of obtaining accurate decision data, we designed experiments to test our approach. Our results indicate that the fine-tuned Qwen 7B and Llama 3-8B models outperform mainstream large language models (LLMs) training in this task. Code is released at https://github.com/microsoft/HeurAgenix/tree/NaDRO (opens in new tab)