Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
- Yifan Pu ,
- Jixuan Ying ,
- Tianzhu Ye ,
- Dongchen Han ,
- Ziyi Wang ,
- Qixiu Li ,
- shao xinyu ,
- Xiaochen Wang ,
- Gao Huang ,
- Xiu Li
NeurIPS 2025 |
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi–Head Self–Attention (MHSA) layer still performs a quadratic query–key interaction for \emph{every} token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce \emph{Visual–Contrast Attention} (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from to with . VCA first distils each head’s dense query field into a handful of spatially pooled \emph{visual–contrast tokens}, then splits them into a learnable \emph{positive} and \emph{negative} stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than \,M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from to \textbf{} (+) and improves three strong hierarchical ViTs by up to \%, while in class-conditional ImageNet generation it lowers FID-50K by to points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers.