Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu; Jixuan Ying; Tianzhu Ye; Dongchen Han; Ziyi Wang; Qixiu Li; shao xinyu; Xiaochen Wang; Gao Huang; Xiu Li

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu ,
Jixuan Ying ,
Tianzhu Ye ,
Dongchen Han ,
Ziyi Wang ,
Qixiu Li ,
shao xinyu ,
Xiaochen Wang ,
Gao Huang ,
Xiu Li

NeurIPS 2025 | November 2025

下载 BibTex

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from to with . VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than \,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from to \textbf{} (+) and improves three strong hierarchical ViTs by up to \%, while in class-conditional ImageNet generation it lowers FID-50K by to points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers.