Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
- Chuanyang Zheng ,
- Jiankai Sun ,
- Yihang Gao ,
- Enze Xie ,
- Yuehao Wang ,
- Peihao Wang ,
- Ting Xu ,
- Matthew Chang ,
- Liliang Ren ,
- Jingyao Li ,
- Jing Xiong ,
- Kashif Rasul ,
- Mac Schwager ,
- Anderson Schneider ,
- Zhangyang Wang ,
- Yuriy Nevmyvaka
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on \(\mathrm{Softmax}\) as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using \(\mathrm{Softmax}\) to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \(\textbf{zero-additional-cost}\) Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to \(\mathrm{Softmax}\). We demonstrate that this router generalizes both \(\mathrm{Sigmoid}\)– and \(\mathrm{Softmax}\)-based routers. Based on empirical observations and established practices in FFN implementation, we recommend the use of \(\mathrm{ReLU}\) activation and \(\ell_2\)-normalization in \(\mathrm{KERN}\) router function. Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.