Efficient and Near-Optimal Algorithm for General Contextual Dueling Bandits with Offline Regression Oracles

We study the contextual dueling bandit problem, where a learner uses contextual information to make two decisions and receives only relative (comparison) feedback. This problem is crucial in reinforcement learning with human feedback (RLHF), widely applied in AI alignment to integrate human preferences into AI models. Unlike prior works, we consider general preference relations and propose the first efficient, near-optimal regret algorithm using an offline regression oracle. Existing approaches rely on online oracles, which are often impractical for complex function classes, leading to poor performance in challenging settings. Our key contribution is analyzing the contextual Best Response regret and developing an \(\tilde{O}(K\sqrt{T})\) regret algorithm, explicitly incorporating offline oracle performance. We further extend our results to continuous decision spaces, achieving a regret bound of \(\tilde{O}(\sqrt{dT})\), marking the first such result for contextual dueling bandits with offline oracles in infinite arm spaces. Our work resolves an open problem in efficient contextual dueling bandits, with potential applications in AI alignment and LLM fine-tuning. Theoretical findings are supported by empirical results.