About
Leading cross-team and cross-org initiative for [Efficient AI at Scale]. Our focus is on efficient learning of massive neural networks for both model (e.g., neural architecture search, model compression, sparse and modular learning) and data efficiency (e.g., zero-shot and few-shot learning, semi-supervised learning). We develop state-of-the-art computationally efficient models and techniques to enable AI practitioners, researchers and engineers to use large-scale models in practice. Our technologies have been deployed in several enterprise scenarios including Turing, Bing and Microsoft 365.
Honors: 2022 MIT Technology Review Innovators under 35 Semi-finalist (listed in 100 innovators under 35 world-wide) for work on Efficient AI.
Prior to joining MSR, I was leading the information extraction efforts to build the Amazon Product Knowledge Graph. I graduated summa cum laude from the Max Planck Institute for Informatics, Germany with a PhD in 2017. I was awarded the 2018 SIGKDD Doctoral Dissertation Runner-up Award for my thesis on credibility analysis and misinformation. I previously worked at IBM Research on domain adaptation of question-answering systems, sentiment analysis and opinion mining.
Comprehensive list of publications in [Google Scholar] [Semantic Scholar] [DBLP].
Refer to [recent news] for updates!
I have been fortunate to collaborate with several talented PhD interns and researchers. Trying to maintain a list [here].
Featured content
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models
Releasing AdaMix code for parameter-efficient tuning of large language models. By only tuning 0.1−0.2% of PLM parameters, AdaMix outperforms SOTA methods (e.g., LoRA, Adapters), and the first method to outperform full model fine-tuning for both NLU and NLG tasks.
AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models
AutoDistil leverages Neural Architecture Search and knowledge distillation to generate a gallery of compressed models all-at-once with variable computational cost (FLOPs) and performance with as much as 41x computation reduction.
LiST: Lite Self-training Makes Efficient Few-shot Learners
Releasing code for LiST - a new SOTA fine-tuning method for large pre-trained language models for few-shot learning. LiST improves by 35% over traditional fine-tuning and 6% over prompt-tuning with 96% reduction in number of trainable parameters --- when fine-tuned with no more than 30 labeled examples from target domain.
Constrained Language Understanding Evaluation Standard (CLUES) - Benchmark for Few-shot Learning
Releasing CLUES (NeurIPS 2021 Benchmark), a few-shot learning benchmark for natural language understanding. Moving beyond traditional GLUE and SuperGLUE benchmarks with thousands of training labels, we develop CLUES to evaluate the true few-shot learning performance of large language models, akin to how humans learn from only a few demonstrative examples. This new benchmark provides a new gold standard for comparing large language model performance to that of humans.
XtremeDistilTransformers: Massive Distillation/Compression of Massive Multilingual Neural Networks
Releasing XtremeDistilTransformers code and checkpoints (ACL 2020) -- extremely small distilled task-agnostic transformer model (checkpoints) that leverage task transfer for learning small universal models that can be applied to arbitrary tasks and languages. We release 3 distilled task-agnostic checkpoints in HuggingFace with 13MM, 22MM and 33MM parameters obtaining SOTA performance in several GLUE tasks and SQuAD.
Uncertainty-aware Self-training for Few-shot Text Classification
Releasing UST code (NeurIPS 2020, Spotlight) for few-shot training of pre-trained language models (e.g., BERT, GPT) with only a few labeled examples (e.g., 20-30) and large amounts of unlabeled data. Integrated with HuggingFace.