Improving Vision Transformers with Nested Multi-head Attentions

Jiquan Peng; Chaozhuo Li; Yi Zhao; Yuting Lin; xiaohan fang; Jibing Gong

Improving Vision Transformers with Nested Multi-head Attentions

Jiquan Peng ,
Chaozhuo Li ,
Yi Zhao ,
Yuting Lin ,
xiaohan fang ,
Jibing Gong

ICME | March 2023

Download BibTex

Vision transformers have significantly advanced the field of computer vision in recent years. The cornerstone of vision transformers is the multi-head attention mechanism, which models the interactions between the visual elements within a feature map. However, the vanilla multi-head attention paradigm learns the parameters of different heads independently and separately. The crucial interactions across different attention heads are ignored, leading to the redundancy and under-utilization of the model’s capacity. In order to facilitate the model expressiveness, we propose a novel nested attention mechanism Ne-Att to explicitly model cross-head interactions via a hierarchical variational distribution. Extensive experiments are conducted on image classification, and the experimental results demonstrate the superiority of Ne-Att. Our code is available at \url{https://anonymous.4open.science/r/Anonymization-EEBD/}.