RoboSeer: Video Generators Can Be Generalizable Robot Manipulators

NeurIPS 2025 |

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent vision-language-action models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present RoboSeer, a new approach that shifts from understanding to generation. Instead of solely predicting the next action, RoboSeer also imagines and generates the future visual outcome of that action. Built on a multi-modal Diffusion Transformer, RoboSeer jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. RoboSeer demonstrates strong generalization, including imitating other embodiments’ skills and handling novel objects. This dual-prediction strategy—forecasting both actions and their visual consequences—marks a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.