Microsoft Research Blog

English

  1. M365 Research banner: network of connected points

    Kernel‑level innovation and hardware‑aware modeling  

    October 22, 2025

    We design and optimize GPU kernels and model‑execution strategies to maximize throughput and minimize latency for real‑world LLM workloads. Interactive enterprise scenarios often run at low batch sizes, interleave very long contexts, and have strict latency targets—exposing different bottlenecks than training. Our work includes attention‑kernel…

  2. M365 Research banner: network of connected points

    System‑level innovation for inference at scale  

    October 22, 2025

    We reimagine the AI inference stack to be workload-aware, cost-aware, and resilient at a global scale. Our research explores innovative resource allocation, request scheduling, batching, routing, and KV caching techniques, which directly benefit Microsoft's inference infrastructure. Our goal is to bridge the gap between deployed…

  3. M365 Research banner: network of connected points

    Efficient AI applications: context engineering and agents 

    October 22, 2025

    Modern AI systems face a dual challenge: delivering high‑quality outputs while staying cost- and latency‑efficient. Every token processed and every millisecond of compute impacts scalability, user experience, and sustainability. Efficiency isn’t just an optimisation, it’s a design principle that makes AI applications feasible and scalable.…

  4. AID landing page banner

    Efficient AI 

    October 22, 2025

    Reimagining AI efficiency from GPU kernels to context engineering to power Copilot-scale intelligence.

  5. Language Ranker: A Lightweight Ranking framework for LLM Decoding 

    October 22, 2025

    Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the…

  6. Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples 

    October 22, 2025 | Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, and Daniela Rus

    Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical…

  7. Workflow icons showing tasks, thinking, and time, linked to a person symbol on a gradient background.

    Tell me when: Building agents that can wait, monitor, and act 

    October 21, 2025

    SentinelStep enables AI agents to handle monitoring tasks that run for hours or days, like watching for emails or tracking prices. It works by managing when agents should check and their context, avoiding wasted resources and missed updates.

  8. Kirby-Judge: Think Once, Judge Anywhere 

    October 20, 2025 | Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, and SeYoung Yun

    Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and…