About the workshop
Full workshop title: The 5th Workshop on Computer Vision in the Wild (CVinW): Towards Unified Multimodal Agents for Reasoning in the Wild
- Note: The date of this workshop is tentative, so please check the official workshop page for the final agenda (opens in new tab).
Host conference: The Conference on Computer Vision and Pattern Recognition (CVPR) (opens in new tab) | June 3-4, 2026
Workshop organizers: Reuben Tan, Zhengyuan Yang
Workshop scientific advisor: Jianfeng Gao
Speakers:
- Kate Saenko, Meta AGI Foundations
- Chelsea Finn, Stanford & PI
- Manling Li, Northwestern University
- Xiaolong Wang, UCSD & Nvidia
- Mohit Bansal, University North Carolina Chapel Hill
The 5th CVinW workshop brings together researchers building multimodal AI agents that can perceive, reason, and act in digital and physical environments. The workshop focuses on capabilities where today’s agentic models still struggle, including not limited to fine-grained spatiotemporal reasoning, causal inference, long-horizon planning & memory, and robust tool-use, and it convenes both academia and industry to discuss approaches, datasets, and benchmarks for robust agents that complete complex tasks “in the wild.”
This year’s edition emphasizes the intersection of LMMs and VLA models and the full loop from representation to inference to decision-making, including structured reasoning strategies (e.g., chain-/tree-of-thought, program-aided reasoning), long-horizon planning/memory, and evaluation protocols that diagnose reasoning (not just recognition).
Challenges
To measure progress with fine-grained evaluations and public leaderboards, the workshop proposes two challenges:
- MindCube (Spatial Mental Models under Partial Observability)
- Evaluates whether VLMs can form robust spatial mental models, by capturing positions, orientations, and counterfactual “what-if” dynamics, from limited viewpoints.
- SITE (Standardized, Cross-modal Spatial Intelligence Thorough Evaluation)
- Evaluates spatial intelligence across single-image, multi-image, and video modalities and across spatial factors (scale, visualization vs. orientation, intrinsic vs. extrinsic frames, static vs. dynamic).
Call for contributions
- Submissions of published and unpublished works are welcome.
- Accepted works will be presented as posters and spotlights at the workshop.
Important dates
- Workshop papers
- Paper submission deadline: April 21, 2026
- Notification: May 19, 2026
- Camera-ready deadline: June 2, 2026
- Challenges
- Start date: January 21, 2026
- End date: June 2, 2026