Microsoft Research Blog

VITRA Redefines VLA Pre-training Paradigms via Human Video Reconstruction 

May 29, 2026
When you see robots participating in running races or performing folk dances on stage, you might envision a future where a simple natural language command is all it takes for a robot to tidy up a desk, clean a room, or even serve tea. For…

Recent Posts

  1. graphical user interface, application

    Fara1.5 – A family of frontier computer use agent models 

    May 21, 2026

    By: Ahmed Awadallah, Sahil Gupta, Yash Lara, Yadong Lu, Hussein Mozannar, Akshay Nambi, Zach Nussbaum, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Luiz do Valle, Vibhav Vineet, Spencer Whitehead, Andrew Zhao We are excited to introduce the Fara1.5 family of computer use agent (CUA)…

  2. A scale showing coffee is worth more than gold

    Whimsical Strategies Break AI Agents: Generating Out-of-Distribution Adversarial Strategies at Scale 

    May 6, 2026

    By Zachary Huang, Tyler Payne, Gagan Bansal, Will Epperson, Wenyue Hua, Adam Fourney, Amanda Swearngin, Maya Murad, Ece Kamar, Saleema Amershi   As AI agents are increasingly deployed to handle real transactions and negotiations, they can exhibit vulnerabilities that traditional safety testing struggles to fully capture. Our prior work on Magentic Marketplace found significant vulnerability for smaller…

  3. Webwright architecture

    Webwright: A Terminal Is All You Need For Web Agents 

    May 4, 2026

    By Yadong Lu1, Lingrui Xu2, Chao Huang2, Ahmed Awadallah11Microsoft Research, 2The University of Hong Kong Instead of solving web tasks by predicting where to click one at a time, we only give the model a terminal where it has the full freedom to spawn browser…

  4. graphical user interface, application

    The Art of Building Verifiers for Computer Use Agents 

    April 21, 2026

    By Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah We share lessons learned from building a best-in-class verifier for computer use agent trajectories on the web, called the Universal Verifier. False positive rates drop to near zero (vs. ≥45% for WebVoyager, ≥22% for…

  5. diagram

    Memento: Teaching LLMs to Manage Their Own Context 

    April 8, 2026

      Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos We taught models to compress their own chain-of-thought mid-generation. Peak KV cache drops 2–3x, throughput nearly doubles, and the erased reasoning blocks leave traces…

  6. a person writing on a whiteboard

    Actions Speak Louder Than Prompts: Rethinking How LLMs Reason Over Graph Data 

    March 3, 2026

    By Ben Finkelshtein (opens in new tab) (University of Oxford), Silviu Cucerzan, Sujay Kumar Jauhar, and Ryen W. White (Microsoft)  Think about the last time you opened a shared document at work. Behind that simple action lies a complex network of relationships: the colleagues who edited the file before you, the team site on…

  7. Teenager playing guitar while using AI to learn

    Experiential Reinforcement Learning 

    February 20, 2026

    By Taiwei Shi, Sihao Chen, Longqi Yang, Jaime Teevan Reinforcement Learning is at the core of building and improving frontier AI models and products. Yet most state-of-the-art RL methods learn primarily from outcomes: a scalar reward signal that says whether an attempt worked, not why…

  8. Collab AI group header | two people conversing with another person displayed on a virtual mobile device

    From One to Many 

    February 9, 2026

    By Jaime Teevan, Chief Scientist & Technical Fellow In recent years we’ve all lived through the transition to cloud computing, a sudden shift to remote work, and now the rapid rise of AI. Each individually has felt like a seismic event, but in reality they…

  9. graphical user interface

    Phi-Ground: Improving how AI agents navigate screen interfaces 

    January 18, 2026

    Imagine an AI assistant that can navigate a computer the same way humans do—clicking buttons, filling out forms, and moving between applications—all by simply interpreting what's on the screen. This vision is becoming a reality through computer use agents—AI systems designed to operate software interfaces…

  10. Deep Video Discovery

    Deep Video Discovery: Using agentic search to analyze long-form video 

    December 19, 2025

    Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours‑long recordings…

Explore More

  • Events & conferences

    Events & conferences 

    Meet our community of researchers, learn about exciting research topics, and grow your network

  • Podcasts

    Podcasts 

    Ongoing conversations at the cutting edge of research

  • Microsoft Research Forum

    Microsoft Research Forum 

    Join us for a continuous exchange of ideas about research in the era of general AI