Efficient AI
We are building more efficient AI systems by automating quantization through a novel compiler-based approach grounded in programming language (PL) techniques. This automation enables scalable, energy-efficient deployment of models for on-device inference. In parallel, we are advancing the state of the art in quantization itself, with methods such as QuaRot and ongoing research into new compression strategies.
Our work is already powering real-world applications: our quantization pipeline is used in Copilot+ PCs via Phi Silica, and multiple models are now integrated into the AI Toolkit in Visual Studio Code. By combining automation with cutting-edge quantization, we aim to make high-performance AI more accessible and sustainable.
We are also designing next-generation optimizers, through rigorous, first-principles investigation of model training and its intrinsic properties; we analyse learning dynamics to rigorously characterize model behaviour and move beyond ad hoc, empirical trial-and-error approaches.
By integrating theoretical frameworks from optimization theory, dynamical systems, and statistical learning theory, we seek to derive principled enhancements in the training of large-scale models.
Learn more:
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
Publication | February 2025
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Publication | February 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training
Publication | February 2025
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs (opens in new tab)
Publication | December 2024
Low-Rank Correction for Quantized LLMs (opens in new tab)
Publication | December 2024
[2412.08585] TurboAttention: Efficient Attention Approximation For High Throughputs LLMs (opens in new tab)
Publication | December 2024
Pyramid Vector Quantization for LLMs (opens in new tab)
Publication | October 2024