a tall building lit up at night

Microsoft Research Lab – Asia

ICLR 2022 Spotlight: Demystifying local attention and dynamic depth-wise convolution

Share this page

In the past two years, there have been numerous papers written on Transformer, and researchers are designing Transformer models for all kinds of tasks. However, is attention, the core module of Transformer, really stronger than convolution? This paper may bring to you a new perspective. Researchers from Microsoft Research Asia have looked into local attention and dynamic depth-wise convolution and found that a common convolution structure is in fact no worse than Transformer. The related paper, “On the Connection between Local Attention and Dynamic Depth-wise Convolution (opens in new tab),” has been published in ICLR 2022. The code is available on GitHub (opens in new tab).

Paper: https://arxiv.org/abs/2106.04263 (opens in new tab)

Code: https://github.com/Atten4Vis/DemystifyLocalViT (opens in new tab)

What is local attention?

ViT came into existence in 2020 and quickly occupied the field of model design. Various Transformer-based structures were proposed, and some prior knowledge that had been successful in convolutional neural networks, such as local operation, multi-scaling, shuffling, and other operations, as well as inductive bias, were introduced into Transformer. One of the more successful cases has been Microsoft Research Asia’s Swin Transformer. By introducing local operation into ViT, Swin Transformer used the shifted window to obtain SOTA results on multiple tasks and won the 2021 ICCV Best Paper Award – the Marr Award. So what is the mystery of the most central module in Swin Transformer, local attention?

Local attention is essentially a feature aggregation module in a 2D local window. The aggregation weight of each position can be obtained by calculating the attention similarity between different KQV (mainly including dot-production, scaling, and SoftMax). It is a parameter-free, dynamic computational local feature computation module.

formular
aij is the aggregation weight,xij is the input feature

The previous version of this ICLR article (Demystifying Local Vision Transformer) was first published on arXiv in June 2021 and analyzed three powerful design principles of local attention:

(1) Sparse connectivity – This refers to that some output variables and some input variables are not connected to each other. It effectively reduces the complexity of the model without reducing the number of input and output variables. In local attention, sparse connectivity is reflected in two aspects: One, in the spatial dimension of local attention, each output value is only connected to the input in the local window, which is different from the full pixel (token) connection of ViT, and the other, in the channel dimension of local attention, each output channel is only connected to one input channel without cross-connection, which is different from group convolution and normal convolution.

(2) Weight Sharing – This means that there are some connections with the same weights, which reduces the number of parameters in the model and enhances the model without increasing training data. When a weight is shared in a model, this can be considered as an increase of training samples for the weight, which helps optimize the model. In local attention, weight sharing is achieved through multi-head self-attention, and then the channels are divided into heads (groups), and a set of aggregation weights are shared within the same head to reduce the number of parameters (non-model parameters) of the aggregation weights.

(3) Dynamic Weight – According to the characteristics of different samples, the connection weights are dynamically generated. This can increase the capacity of the model. If connection weights are viewed as hidden layer variables, then these dynamic weights can be viewed as a second-order operation that increases the capacity of the model. The dynamic weight of local attention is reflected in the fact that the aggregated weight of each connection is calculated according to sample characteristics using a dot-product-based method.

Local attention is able to yield excellent results because of the three model design principles listed above. However, these characteristics also exist naturally in CNN structures, especially in dynamic depth-wise convolution.

The connection between dynamic depth-wise convolution and local attention

If you gradually dismantle the operation of local attention, you’ll find that in the three dimensions of sparse connection, weight sharing, and dynamic weight, it is very similar to depth-wise convolution in the historically superior CNN structure. Depth-wise convolution has been used for a long time, so what are its guidelines in model design?

(1) Sparse connectivity. It’s easy to see that the sparse connectivity characteristics of depth-wise convolution are exactly the same as local attention: it is locally connected in the spatial dimension and sparsely connected in the channel dimension.

(2) Weight sharing. The concept of weight sharing originally emerged from the convolution operation. Depth-wise convolution also benefits from weight sharing, but it is slightly different from local attention. Depth-wise convolution shares weights in the spatial dimension, where each position is aggregated with the same weight of the convolution kernel. In the channel dimension, each channel uses an independent aggregation weight.

(3) Dynamic weight. The design principle of dynamic weights was not used in the original depth-wise convolution, but dynamic convolution can easily introduce dynamic characteristics into depth-wise convolution to form feature-dependent aggregate weights.

Although the manner of sharing is different for the two in terms of weight sharing, it has been found through experimental results that the impact of sharing weights in either channel or spatial dimensions is relatively small. Local MLP (local attention without dynamic characteristics) serves an example of this. The weight sharing can reduce the number of parameters in the model and help with model optimization. In terms of dynamic weights, although the two are different, depth-wise convolution can still be easily equipped with dynamic characteristics.

table
Table 1: Comparison of different structures on sparse connectivity, weight sharing, and dynamic weight. D-DW-Conv. refers to dynamic depth-wise convolution.

The performance of depth-wise convolution

If the design principles of depth-wise convolution and local attention are so similar, then why does local attention achieve such high performance, while depth-wise convolution does not? In order to answer this question, the researchers used depth-wise convolution to replace all local attention modules in Swin Transformer, while other structures remained unchanged (pre-LN was modified to post-BN). At the same time, in order to verify the effect of dynamic depth-wise convolution, researchers at Microsoft Research Asia constructed two dynamic depth-wise convolutions:

(1) D-DW-Conv. The first dynamic depth-wise convolution adopts the same weight sharing method as ordinary depth-wise convolution: spatial space shared convolution kernel and independent convolution kernel between channels. It also uses Global Average Pooling to process input features and then dynamically predicts dynamic convolution kernels through the FC Layer.

(2) I-D-DW-Conv. The second dynamic depth-wise convolution adopts the same weight sharing method as Local Attention. Each pixel (token) adopts an independent aggregation weight, and the weight is shared in the channel head (group). This is called inhomogeneous dynamic depth-wise convolution.

Experimental results are shown below:

table
Table 2: Comparison of results on ImageNet1k, COCO, ADE20K

For the experiments covered in this paper, we used the exact same training parameters and network structure as Swin Transformer. Depth-wise convolution achieved the same performance as Swin Transformer in ImageNet classification, COCO detection, and ADE20K semantic segmentation, and it required less computational cost.

Depth-wise convolution truly isn’t so bad! Some people may ask if local attention would have more advantages on larger models and larger datasets. Due to limitations of computing resources, the researchers only carried out some experiments on the base model:

table
Table 3: ImageNet22k Pre-training

Results of pre-training conducted on the large-scale ImageNet22k show that depth-wise convolution is still comparable to local attention. Recent work such as ConvNext [1] and RepLKNet [2] has further proved this.

Why does modern convolution perform well and how would we design a better model?

If depth-wise convolution performs so well, why has it still not attracted widespread attention after such a long time? After carrying out comparisons with traditional CNN, researchers have found that the design of modern convolution generally meets the three design principles indicated in this paper. Meanwhile, Swin Transformer and other new structures use a larger Kernel Size, such as 7×7 or 12×12, which is much larger than the 3×3 convolution that has long been used in CNN.

Depth-wise convolution, combined with reasonable dynamic characteristics and large kernel size, coupled with modern network training configurations (data enhancement, optimization, regularization) and other strategies will be the power behind modern convolution.

As for how to design better models, we should first analyze the commonalities among strong existing models. Figure. 1 shows the sparse characteristics of different model structures. The sparser the model, the more conducive to the optimization of the training stage, resulting in better performance and a reduction in the computational complexity of the model, so that the network can build and design modules that have more freedom.

diagram
Figure 1: (a) convolution (b) global attention (c) local attention、DW convolution (d) 1x1convolution (e) fully-connected MLP

In addition, the paper also presents a relation graph to illustrate the evolution process of certain design principles generated during the model structure design process:descript

diagram
Figure 2: ViT and Local ViT refer to the attention building block in them; PVT is the transformer with pyramid structure and introduces spatial low rank; Dim. Sep. refers to sparsity in channel dimension; Locality Sep. refers to sparsity in spatial position; LR refers to low rank; MS Conv. refers to multi-scale convolution.

In the relation graph, the regularization mode or the introduction of dynamic weights are successively enhanced from top to bottom. With the increase of regularization and dynamic weights, the inductive bias of the network also increases. This brings optimization benefits, making the network easier to train and yielding better results, as verified by the existing experimental results. In the end, sparse and dynamic evolution will move towards dynamic-based depth-wise convolution. In combination with modern large kernel training principles, better performance will be achieved.

[1] Liu Z, Mao H, Wu C Y, et al. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.

[2] Ding X, Zhang X, Zhou Y, et al. Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs. CVPR, 2022.