About
I received the B.S. degree in computer science from ShanghaiTech University and Ph.D degree in the electronic and computer engineering from the Hong Kong University of Science and Technology.
During my Ph.D. studies, I focused on the intersection of signal processing and machine learning. I aimed to demystify deep learning by applying signal processing tools such as sparse coding. I was selected as one of the top 2% of scientists worldwide.
After graduation, I joined Microsoft. With large language models (LLMs) becoming a key focus for productivity, I shifted my research toward LLMs and large multimodal models (LMMs) to align with industry interests.
I am currently leading a project of VLM training, and being a contributor of Image generation and editing in MSRA.
My current research interests span unified models, Muon optimizer, SFT versus RL, and algorithmic aspects of neural architectures. I conducted some of the first‑batch explorations in several emerging areas, including
1. Understanding when generation benefits understanding in unified models (on-going);
2. Investigating the emergence of reasoning and planning abilities in LLMs and analyzing the theoretical gap between SFT and RL:
- NeurIPS’24 | ALPINE: Unveiling The Planning Capability of Autoregressive Learning in Language Models (opens in new tab)
- Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective (opens in new tab)
3. Unrolling classical algorithms into neural architectures—a modern revival now appearing under names such as “looped models” or “test‑time regression”:
- ICLR’23 (Oral) | Sparse Mixture-of-Experts are Domain Generalizable Learners (opens in new tab)
- Some old works in GNNs (WCGCN (opens in new tab), GF-CF (opens in new tab)) and CNNs (FPN-OAMP (opens in new tab)).
Together with my amazing colleagues, we applied these techniques to fields such as embodied AI (Habi (opens in new tab) Diffusion Veteran (opens in new tab)) and AI for Science (Omni-DNA (opens in new tab) MIMSID (opens in new tab) MuDM (opens in new tab) GraphormerV2 (opens in new tab)).
In my spare time, I contribute to community projects on efficient LLM training on low-resource GPUs:
- BlockOptimizers | Full parameter finetuning 8B models on RTX3090 and 70B models on 4 A100s (opens in new tab)
- LMMs-Engine | High-performance any-to-any modality model training framework. (opens in new tab)
I also write educational blogs to share technical insights, which have accumulated more than 10k followers and 10k favorites:
- Lectures on Triton Programming (opens in new tab)
- Notes on Statistical Machine Learning (opens in new tab)
- Notes on Graph Neural Networks (opens in new tab)
- Notes on Navier-Stokes Equations (opens in new tab)
- Notes on Non-convex Optimization (opens in new tab)