Microsoft Research Blog

Artificial intelligence

  1. Evaluating LLM Reasoning Beyond Correctness and CoT 

    February 12, 2026 | Soheil Abbasloo

    What does it truly mean for a language model to “reason”? Current evaluations reward models’ correct standalone answers—but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a…

  2. AgentRx: Diagnosing AI Agent Failures from Execution Trajectories 

    February 2, 2026

    AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API…

  3. RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents 

    February 1, 2026

    Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters, enabling dynamic execution, debugging, and interactive programming capabilities. While these advancements have streamlined complex workflows, they have also introduced critical safety and security risks. Current static safety…

  4. When does predictive inverse dynamics outperform behavior cloning? 

    January 29, 2026

    Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM).…

  5. Scaling medical imaging report generation with multimodal reinforcement learning 

    January 23, 2026

    Frontier models have demonstrated remarkable capabilities in understanding and reasoning with natural-language text, but they still exhibit major competency gaps in multimodal understanding and reasoning especially in high-value verticals such as biomedicine. Medical imaging report generation is a prominent example. Supervised fine-tuning can substantially improve…

  6. SALAD-VAE: Semantic Audio Compression with Language-Audio Distillation 

    Modern generative and multimodal models increasingly rely on compact latent representations that trade and balance semantic richness with high-fidelity reconstruction. We introduce SALAD-VAE, a continuous and highly compact semantic Audio Variational Autoencoder, which operates in the frequency domain and achieves state-of-the-art compression with very low…

  7. Towards Real-Time Generative Speech Restoration with Flow-Matching 

    January 1, 2026 | Tsun-An Hsieh and Sebastian Braun

    Generative models have shown robust performance on speech enhancement and restoration tasks, but most prior approaches operate offline with high latency, making them unsuitable for streaming applications. In this work, we investigate the feasibility of a low-latency, real-time generative speech restoration system based on flow-matching…

  8. Sci-Phi: A Large Language Model Spatial Audio Descriptor 

    January 1, 2026 | Xilin Jiang, Sebastian Braun, and Hannes Gamper

    Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language…

  9. Adapting Language Models for Low-Resource Programming Languages 

    December 20, 2025

    Large Language Models (LLMs) have achieved remarkable success in code generation, yet their capabilities remain predominantly concentrated in well-resourced programming languages such as Python and Java. In contrast, low-resource programming languages present a significant challenge due to limited available data and unique syntax features. In…

  10. Multimodal AI generates virtual population for tumor microenvironment modeling 

    December 9, 2025

    The tumor immune microenvironment (TIME) critically impacts cancer progression and immunotherapy response. Multiplex immunofluorescence (mIF) is a powerful imaging modality for deciphering TIME, but its applicability is limited by high cost and low throughput. We propose GigaTIME, a multimodal AI framework for population-scale TIME modeling by…