Improving Vision Transformers with Nested Multi-head Attentions

  • Jiquan Peng ,
  • Chaozhuo Li ,
  • Yi Zhao ,
  • Yuting Lin ,
  • xiaohan fang ,
  • Jibing Gong

ICME |

Vision transformers have significantly advanced the field of computer vision in recent years. The cornerstone of vision transformers is the multi-head attention mechanism, which models the interactions between the visual elements within a feature map. However, the vanilla multi-head attention paradigm learns the parameters of different heads independently and separately. The crucial interactions across different attention heads are ignored, leading to the redundancy and under-utilization of the model’s capacity. In order to facilitate the model expressiveness, we propose a novel nested attention mechanism Ne-Att to explicitly model cross-head interactions via a hierarchical variational distribution. Extensive experiments are conducted on image classification, and the experimental results demonstrate the superiority of Ne-Att. Our code is available at \url{https://anonymous.4open.science/r/Anonymization-EEBD/}.