Perfect Recall, Parallel Efficiency: Multi-Head Latent Attention for Million-Token-Context Decoding
- Yifan Guo ,
- Wei Cui ,
- Peng Cheng
SPIGM @ ICML 2026 |
Token-level dynamic sparse attention exemplified by DeepSeek Sparse Attention (DSA) selects the globally most relevant key-value tokens via an exact Top-$K$
operator, achieving superior model quality over block-level alternatives. However, this exact selection creates a severe distributed inference bottleneck: enforcing an exact global Top-$K$ across GPUs inevitably incurs either redundant full-context retrieval or costly multi-stage cross-device synchronization, which largely negates the computational advantages of DSA at long context lengths. Motivated by the mathematical properties of the softmax function, we hypothesize that incorporating additional, marginally relevant context has negligible impact on the attention output. Leveraging this insight, we propose \emph{Interleaved DeepSeek Sparse Attention} (IDSA), which distributes tokens across GPUs in an interleaved layout so that each device performs only a relaxed local Top-$m$ selection. Under this layout, the union of independent per-GPU Top-$m$ selections near-completely covers the globally most relevant Top-$K$ tokens. This allows each device to proceed with its local selection with minimal cross-GPU overhead while avoiding both expensive full-context Top-$K$ computation and multi-stage cross-GPU merging,
enabling a not only distributed but also synchronization-efficient inference pipeline.