Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

  • Yifan Pu ,
  • Jixuan Ying ,
  • Tianzhu Ye ,
  • Dongchen Han ,
  • Ziyi Wang ,
  • Qixiu Li ,
  • shao xinyu ,
  • Xiaochen Wang ,
  • Gao Huang ,
  • Xiu Li

NeurIPS 2025 |

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from  to  with . VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than \,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from  to \textbf{} (+) and improves three strong hierarchical ViTs by up to \%, while in class-conditional ImageNet generation it lowers FID-50K by  to  points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers.