Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Chuanyang Zheng; Jiankai Sun; Yihang Gao; Enze Xie; Yuehao Wang; Peihao Wang; Ting Xu; Matthew Chang; Liliang Ren; Jingyao Li; Jing Xiong; Kashif Rasul; Mac Schwager; Anderson Schneider; Zhangyang Wang; Yuriy Nevmyvaka

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Chuanyang Zheng ,
Jiankai Sun ,
Yihang Gao ,
Enze Xie ,
Yuehao Wang ,
Peihao Wang ,
Ting Xu ,
Matthew Chang ,
Liliang Ren ,
Jingyao Li ,
Jing Xiong ,
Kashif Rasul ,
Mac Schwager ,
Anderson Schneider ,
Zhangyang Wang ,
Yuriy Nevmyvaka

ICLR 2026 | September 2025

Download BibTex

Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on \(\mathrm{Softmax}\) as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using \(\mathrm{Softmax}\) to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \(\textbf{zero-additional-cost}\) Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to \(\mathrm{Softmax}\). We demonstrate that this router generalizes both \(\mathrm{Sigmoid}\)– and \(\mathrm{Softmax}\)-based routers. Based on empirical observations and established practices in FFN implementation, we recommend the use of \(\mathrm{ReLU}\) activation and \(\ell_2\)-normalization in \(\mathrm{KERN}\) router function. Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.