Improving Vision Transformers with Nested Multi-head Attentions
- Jiquan Peng ,
- Chaozhuo Li ,
- Yi Zhao ,
- Yuting Lin ,
- xiaohan fang ,
- Jibing Gong
ICME |
Vision transformers have significantly advanced the field of computer vision in recent years. The cornerstone of vision transformers is the multi-head attention mechanism, which models the interactions between the visual elements within a feature map. However, the vanilla multi-head attention paradigm learns the parameters of different heads independently and separately. The crucial interactions across different attention heads are ignored, leading to the redundancy and under-utilization of the model’s capacity. In order to facilitate the model expressiveness, we propose a novel nested attention mechanism Ne-Att to explicitly model cross-head interactions via a hierarchical variational distribution. Extensive experiments are conducted on image classification, and the experimental results demonstrate the superiority of Ne-Att. Our code is available at \url{https://anonymous.4open.science/r/Anonymization-EEBD/}.