Microsoft Research Cambridge

Machine Intelligence

Advanced machine learning research, grounded in trust, efficiency, capability.

KBLaM blog | A flowchart illustrating the process of handling a prompt using a language model. The process begins with documents being used to construct and summarize a knowledge base (KB) offline. The summarized KB is then encoded and fed into the main process. A prompt goes through a tokenizer, followed by rectangular attention, and then into the large language model (LLM). The LLM retrieves information from the encoded KB to generate an answer.

Introducing KBLaM: Bringing plug-and-play external knowledge to LLMs

Layered image of Phi Silica, a state-of-the-art small language model integrated into Windows 11 Copilot+PCs

Phi Silica, small but mighty on-device SLM

A Ladder of Reasoning: Testing the power of imagination in LLMs

Efficient AI

We are building more efficient AI systems by automating quantization through a novel compiler-based approach grounded in programming language (PL) techniques. This automation enables scalable, energy-efficient deployment of models for on-device inference. In parallel, we are advancing the state of the art in quantization itself, with methods such as QuaRot and ongoing research into new compression strategies.

Our work is already powering real-world applications: our quantization pipeline is used in Copilot+ PCs via Phi Silica, and multiple models are now integrated into the AI Toolkit in Visual Studio Code. By combining automation with cutting-edge quantization, we aim to make high-performance AI more accessible and sustainable.

We are also designing next-generation optimizers, through rigorous, first-principles investigation of model training and its intrinsic properties; we analyse learning dynamics to rigorously characterize model behaviour and move beyond ad hoc, empirical trial-and-error approaches.

By integrating theoretical frameworks from optimization theory, dynamical systems, and statistical learning theory, we seek to derive principled enhancements in the training of large-scale models.

Learn more:

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
Publication | February 2025

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Publication | February 2025

Gradient Multi-Normalization for Stateless and Scalable LLM Training
Publication | February 2025

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs (opens in new tab)
Publication | December 2024

Low-Rank Correction for Quantized LLMs (opens in new tab)
Publication | December 2024

[2412.08585] TurboAttention: Efficient Attention Approximation For High Throughputs LLMs (opens in new tab)
Publication | December 2024

Pyramid Vector Quantization for LLMs (opens in new tab)
Publication | October 2024