Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

  • Chuanyang Zheng ,
  • Jiankai Sun ,
  • Yihang Gao ,
  • Enze Xie ,
  • Yuehao Wang ,
  • Peihao Wang ,
  • Ting Xu ,
  • Matthew Chang ,
  • ,
  • Jingyao Li ,
  • Jing Xiong ,
  • Kashif Rasul ,
  • Mac Schwager ,
  • Anderson Schneider ,
  • Zhangyang Wang ,
  • Yuriy Nevmyvaka

ICLR 2026 |

Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on \(\mathrm{Softmax}\) as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using \(\mathrm{Softmax}\) to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \(\textbf{zero-additional-cost}\) Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to \(\mathrm{Softmax}\). We demonstrate that this router generalizes both \(\mathrm{Sigmoid}\)– and \(\mathrm{Softmax}\)-based routers. Based on empirical observations and established practices in FFN implementation, we recommend the use of \(\mathrm{ReLU}\) activation and \(\ell_2\)-normalization in \(\mathrm{KERN}\) router function. Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.