Factorized layers revisited: Compressing deep networks without playing the lottery


By , PhD student , Principal Research Engineer , Principal Researcher , Senior Principal Research Manager

From BiT (928 million parameters) to GPT-3 (175 billion parameters), state-of-the-art machine learning models are rapidly growing in size. With the greater expressivity and easier trainability of these models come skyrocketing training costs, deployment difficulties, and even climate impact. As a result, we’re witnessing exciting and emerging research into compressing these models to make them less expensive, small enough to store on any device, and more energy efficient. Perhaps the most popular approach to model compression is pruning, in which redundant model parameters are removed, leaving only a small subset of parameters, or a subnetwork. A major drawback of pruning, though, is it requires training a large model first, which is expensive and resource intensive.

Recent research around pruning has focused on the lottery ticket hypothesis (opens in new tab), which suggests that most parameters of large models are redundant even during training and that there exists a subnetwork responsible for most of the model’s performance. In other words, you can train this subnetwork alone and still obtain the same accuracy as if you had trained all the parameters—provided you are able to identify the subnetwork first. While such subnetworks—called lottery tickets—have been shown to exist by pruning large networks after training, intensive efforts to design methods that guess them before training, such as GraSP (opens in new tab)and hybrid tickets (opens in new tab), haven’t achieved the same performance.

Are sparsity-based methods such as pruning and guessing lottery tickets the right way to obtain compressed models? As part of our paper “Initialization and Regularization of Factorized Neural Layers (opens in new tab),” which we’re presenting at the International Conference on Learning Representations (opens in new tab) (ICLR 2021), we revisit the alternative compression approach of factorized neural layers (opens in new tab). While sparsity-based approaches remove parameters from the weight matrices of a network incrementally, factorization replaces the matrices with products of smaller matrices that are more efficient to store and compute. Despite the fact that factorized layers are more amenable to deployment on deep learning accelerators such as GPUs and software frameworks such as PyTorch, sparsity-based methods remain more popular because they’re seen as better at maintaining high accuracy at high compression rates (opens in new tab) (for example, 10 percent or less parameters remaining).

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Our results contradict this position: we show that if we use the right initialization scheme (spectral initialization, or SI) and the right regularization penalty (Frobenius decay, or FD), we can achieve higher accuracy on three benchmark datasets by training a factorized ResNet from scratch than by pruning or guessing lottery tickets. The key principle underlying these two natural methods, neither of which requires extra hyperparameters, is that the training behavior of a factorized model should mimic that of the original (unfactorized) network.

We further demonstrate the usefulness of these schemes in two settings beyond model compression where factorized neural layers are applied. The first is an exciting new area of knowledge distillation in which an overcomplete factorization is used to replace the complicated and expensive student-teacher training phase with a single matrix multiplication at each layer. The second is for training Transformer-based architectures such as BERT (opens in new tab), which are popular models for learning over sequences like text and genomic data and whose multi-head self-attention mechanisms are also factorized neural layers.

Our work is part of Microsoft Research New England’s (opens in new tab) AutoML research efforts (opens in new tab), which seek to make the exploration and deployment of state-of-the-art machine learning easier through the development of models that help automate the complex processes involved. The code to our work is available at our GitHub repository (opens in new tab).

Factorized neural layers and where to find them

Deep networks are function approximators in which inputs are passed through a sequence of nonlinear transformations \(g(W,x)\), each modifying the previous layer’s output \(x\) using some mapping specified by a weight matrix \(W\)∈ \(\mathbb{R}\)\(^m\)\(^x\)\(^n\). The space and time complexity of each layer is typically tied directly to the dimensionality of this weight matrix; for example, standard implementations of fully connected and convolutional layers require \(O(mn)\) operations to compute \(g(W, x)\).

Two side-by-side depictions of a convolution applied to a 5-by-5 input. On the left, a 3-by-3 filter labeled “k-by-k convolution (c-by-c channels)” is applied to the 5-by-5 input to produce a 3-by-3 output. Next to it is a square labeled “W” described as a “weight tensor (reshaped) (c squared k squared parameters)”; it has side lengths ck and ck. An arrow points to two rectangles: a vertical rectangle labeled “U” and a horizontal rectangle labeled “V superscript T” described as a “factorized weight tensor (2ckr parameters)”; each rectangle has side lengths ck and r. To the right, a 1-by-3 filter labeled “1-by-k convolution (r-by-c channels)” is applied to the 5-by-5 input to produce a 5-by-3 output; applied to this input is a 3-by-1 filter labeled “k-by-1 convolution (c-by-r channels)” to produce a 3-by-3 output.
Figure 1: An example of how to factorize a two-dimensional convolution layer. Here the weight tensor of a regular k × k convolution (left) is reshaped into a matrix and factorized into two smaller weight matrices, each corresponding to a one-dimensional convolution performed along one dimension of an image (right).

Factorized neural layers replace the matrix with a product of matrices; in the simplest case, we have the standard low-rank decomposition \(W\)=\(UV\)\(^T\) for matrices \(U\)∈ \(\mathbb{R}\)\(^m\)\(^x\)\(^r\), \(V\)∈ \(\mathbb{R}\)\(^n\)\(^x\)\(^r\). In the model compression case, we can set \(r\)≪\(m,n\) to improve the complexity of fully connected and convolutional layers from \(O(mn)\) to \(O(r(m+n))\); as shown in Figure 1, for two-dimensional convolutions, this speedup is achieved using two one-dimensional convolutions, one along each spatial dimension. In the case of Transformers, which take as input a sequence \(x\) with hidden dimension \(d\), we can express the action of a multi-head attention (MHA) operation parameterized by \(H\) query \((Q_h)\), key \((K_h)\), value \((V_h)\), and output \((O_h)\) projection matrices as the following summation over its heads \(h\)=1,…,\(H\):

This formulation makes clear that MHA is an aggregation over \(2H\) factorized neural layers with rank \(r=d/H\), half of them parameterized by \(Q_h\) \(Kh^T\) and the other half by \(V_h\) \(O_h^T\).

How should training routines handle factorized neural layers?

It’s straightforward to apply standard deep network training algorithms such as stochastic gradient descent (SGD) to networks with factorized layers. However, modern optimization procedures have many critical aspects, such as initialization and regularization, that significantly influence the performance of the final model. When we factorize a network, the effect of these components can change substantially, so we often can’t use the same settings as for the unfactorized model. For example, in the unfactorized case, the standard weight decay regularization penalizes the squared Frobenius norm (opens in new tab) ||\(W||_{Fro}^{2}\) of each weight matrix in the model; however, if we directly apply this to the factors \(U\) and \(V\) in a factorized layer, we end up penalizing an upper bound on twice the nuclear norm (opens in new tab) of their product:

As indicated in the figure below, this upper bound is tight when training a factorized ResNet using regular weight decay, showing that it effectively penalizes the nuclear norm rather than the Frobenius norm.

Three line plots side by side, each with “training epoch” on the x-axis and “average across layers” on the y-axis. The first, titled “Kaiming init, different regularizers,” has three solid lines representing nuclear norm: one in black and labeled “no-decay,” one in green and labeled “weight decay,” and one in blue and labeled “Frobenius decay,” with the first above the second and the second above the third. It also has three unlabeled dotted lines representing the nuclear norm upper bound corresponding to each solid line. Those corresponding to “no decay” and “weight decay” are very close to their respective solid lines, while the dotted line corresponding to “Frobenius” decay is not. The second plot, titled “Spectral init, different regularizers,” shows the same result. The last plot, titled “Weight decay, different rank scales,” has six pairs of solid and dotted lines labeled and color coordinated by powers of three from –3 to 2; in all cases, the dotted lines are very close to their respective solid lines.
Figure 2: Comparison of nuclear norm (solid line) and nuclear norm upper bound penalized by weight decay on individual factors (dotted line) during the training of ResNet20 on CIFAR-10, showing that for most of training, weight decay is effectively penalizing the nuclear norm.

This example demonstrates how factorizing the model without modifying the training routine can lead to substantial changes in training behavior. Thus, following the principle that the training behavior of a factorized model should mimic that of the original model to recover the latter’s performance, we argue in favor of using spectral initialization (SI) and Frobenius decay (FD) when training factorized neural layers. SI initializes \(U\) and \(V\) by applying a rank-deficient singular value decomposition (SVD) to the standard random initialization of the corresponding full-rank weight matrix \(W\) in the original model, while FD replaces weight decay by penalizing the squared Frobenius norm ‖\(UV\)\(^T\)‖\(_{Fro}^{2}\) rather than the sum of squared norms of the individual factors.

In our paper, we extend a recent approximate analysis of weight decay (opens in new tab) to show that these two schemes allow SGD to maintain a high effective step size when training factorized neural layers, mimicking the role of regular weight decay applied to the original models. More informally, SI follows the mimicking principle since the SVD provably returns the best factorized approximation of initialization used by the original model. FD follows the mimicking principle by regularizing the squared Frobenius norm of the product, as is done by weight decay applied to the original model; in contrast, applying weight decay to the individual factors implicitly regularizes the nuclear norm, as shown in the figure above. Together, these two simple, efficient modifications lead to substantial improvements in model compression, knowledge distillation, and the training of Transformer-based architectures.

Factorization vs. sparsity-based methods for model compression

As mentioned above, sparsity-based approaches such as pruning and guessing lottery tickets represent the prevailing trend in model compression. However, in this section, we show that low-rank factorization is competitive with these methods and can even attain better accuracy at the same parameter count. In particular, the figure below demonstrates that, in the case of low-memory ResNets, factorized neural layers outperform not only sparse (low-memory) training methods like guessing lottery tickets and dynamic sparsity (opens in new tab) but often even full model training followed by pruning. The model—a modified ResNet32—and datasets—CIFAR-10, CIFAR-100, and Tiny-ImageNet—we use reflect the same standard setup considered by the lottery ticket papers GraSP and hybrid tickets.

Three bar graphs side by side, each with “accuracy” on the y-axis and the following data categories corresponding to a specific color bar on the x-axis: original model (black); pruning (blue with black dots); lottery ticket (blue with black dots); dynamic sparsity (blue); fixed sparsity (guessing tickets) (blue); low-rank (orange); and low-rank plus spectral initialization and Frobenius decay (orange).

In the first, titled “CIFAR-10,” the original model is 94.49 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 94.34; pruning, 94.21; lottery ticket, 94.14; low-rank, 93.59; dynamic sparsity, 92.97; and fixed sparsity (guessing tickets), 92.97.

In the second, titled “CIFAR-100,” the original model is 75.41 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 74.41; low-rank, 72.71; lottery ticket, 72.41; pruning, 72.34; fixed sparsity (guessing tickets), 69.70; and dynamic sparsity, 69.66.

In the third, titled “Tiny-ImageNet,” the original model is 63.02 percent accurate and then, in order of highest accuracy: low-rank plus spectral initialization and Frobenius decay, 60.25; low-rank, 58.72; lottery ticket, 57.77; pruning, 57.62; dynamic sparsity, 57.19; and fixed sparsity (guessing tickets), 55.53.
Figure 3: Comparison of the accuracy of several training methods: full model training (black); the sparsity-based approaches full model pruning, in which the full model is trained and then pruned (blue with black dots), and sparse training, in which a sparse model is trained (blue); and low-rank factorized layers (orange). When properly initialized and regularized, low-rank training, particularly with spectral initialization and Frobenius decay, achieves better performance at 10 percent compression than sparsity-based methods.

This superior performance depends critically on the use of spectral initialization and Frobenius decay in tandem; interestingly, in the paper, we find that both schemes decrease accuracy when used independently. While factorization isn’t always best at very high compression rates, the figure above shows that it’s clearly superior in the standard lottery ticket regime, when the compressed model accuracy is close to that of the original model (usually this means the compressed model has 10 percent or more of the original parameters). Notably, our approach doesn’t require any additional tuning, as the decay coefficient used for the uncompressed model is the same one used by FD. Thus, factorized neural layers serve as a strong, simple baseline regardless of whether we’re targeting memory savings or fast computation.

Teacher-free teaching with overcomplete knowledge distillation

While our motivating application is model compression, factorized neural layers can also be found in other applications, such as knowledge distillation, a field that studies how to obtain a better small model by “teaching” it using a more powerful large model. Recent work on using overparameterization to train compact networks (opens in new tab) suggests that factorized neural layers can help us avoid the (expensive) two-stage student-teacher training process of knowledge distillation by using an overcomplete factorization of each of the student’s weight matrices \(W\)=\(UV\)\(^T\), where matrix factors \(U\)∈ \(\mathbb{R}\)\(^m\)\(^x\)\(^r\),\(V\)∈ \(\mathbb{R}\)\(^n\)\(^x\)\(^r\) have inner dimension \(r\)≥\(m,n\), making the factors wider than the original weight matrix, or even by using a deep factorization \(W\)=\(UMV\)\(^T\). The goal is to take advantage of the training benefits of deeper or wider networks suggested by recent theoretical work (opens in new tab) while not actually increasing the model capacity. While this increases the parameter count during training, afterward we can directly obtain a network of the original size by multiplying the factors together. We refer to this technique as overcomplete knowledge distillation.

A flow chart of low-rank model compression and overcomplete knowledge distillation applied to a convolutional neural network. On the left, an image of a frog is passed through three weight matrices, each followed by a nonlinearity, and then on to a classifier that labels the image a “frog.” The second weight matrix has two arrows pointing from it, both labeled “factorize during training.” The first points to the low-rank model compression pipeline. Two rectangles represent weight matrices: one labeled “U” and the other “V superscript T,” with U being tall and narrow and V superscript T being short and wide. Below the image is the label “($\ll mn$ parameters).” Another arrow labeled “keep factorized during inference” points to the same image. The second arrow labeled “factorize during training” points to the overcomplete knowledge distillation pipeline. Two rectangles represent weight matrices: one labeled “U” and the other “V superscript T,” with U being short and wide and V superscript T being tall and narrow. Below the image is the label ($>mn$ parameters). An arrow labeled “multiply back before inference” points to a square labeled “UV superscript T” with the caption “(mn parameters).”
Figure 4: Comparison between the model compression (top) and knowledge distillation (bottom) pipelines. While both train a factorized model, in overcomplete knowledge distillation, the factors are remultiplied after training to bring the parameter count back to that of a normal model. This is a teacher-free way of taking advantage of the training benefits of larger models without suffering their larger inference cost.

Since spectral initialization isn’t applicable when the decomposition is overcomplete, we investigate the effect of Frobenius decay alone. We find that this regularization is critical for overcomplete knowledge distillation using ResNets trained on CIFAR data; in fact, overparameterizing and applying regular weight decay decreases model accuracy. On the other hand, using FD, we can train an overparameterized ResNet56 (1.7 million parameters during training/850,000 parameters at inference time) that matches the performance of ResNet110 (1.7 million parameters at both training and inference). Furthermore, training the overcomplete ResNet56 is 1.5 times faster than training the regular ResNet110. These results are the first successful application of overcomplete knowledge distillation for large-depth neural networks on CIFAR.

Factorization-aware training of Transformers

Finally, we apply spectral initialization and Frobenius decay in the MHA module of Transformer-based architectures. While we can show that this indeed helps when training such models using regular SGD, large-scale unsupervised models, such as BERT, are usually trained using variants of adaptive algorithms such as Adam (opens in new tab)or LAMB (opens in new tab)that “decouple (opens in new tab)” weight decay by defining it as an operation that subtracts a small constant times the parameter from itself before taking a gradient step. While the equivalence breaks for adaptive methods, in the case of SGD, this is equivalent to adding a squared Frobenius penalty term to the objective. Thus for adaptive algorithms, we can devise a similar “decoupled” FD scheme that subtracts a small constant times \(UV\)\(^T\) \(V\)=∇\(_U\)\(_{2}^{1/}\) ‖\(UV\)\(^T\) ‖ \(_{Fro}^{2}\) and \(VU\)\(^T\) \(U\)=∇\(_V\)\(_{2}^{1/}\) ‖\(UV\)\(^T\) ‖\(_{Fro}^{2}\) from the factors \(U\) and \(V\), respectively.

Figure 5: A diagram of how multi-head self-attention implicitly consists of 2H factorized neural layers. Specifically, multi-head attention is a sum over H attention heads (orange), each a matrix product of two terms: the first produced by the low-rank bilinear form Qh KhT and the second by the low-rank linear transform Vh OhT

Applying FD in this manner to the MHA module when pretraining BERT-Base (110 million parameters) on unsupervised data yields a better downstream evaluation on the SQuAD task. When the MHA embedding dimension is halved (14.3 million fewer parameters), the advantage of FD continues to hold. On BERT-Large (340 million parameters), we’re able to halve the MHA embedding dimension (44.2 million fewer parameters) while losing less than a point in terms of F1-score.

Next steps in efficient model training

By studying how a standard initialization scheme and standard regularization scheme need to be modified to handle factorized layers, we were able to obtain better algorithms for learning efficient models, knowledge distillation, and training Transformers. We hope these results spur more investigation into how different components of standard training pipelines behave when training efficient models from scratch. For example:

Virtual speaker series
Would you like to know more about AutoML research and its community at Microsoft? Join us for our free virtual speaker series “Directions in ML: AutoML and Automating Algorithms.” Learn more and register.

We’re excited by how this work can advance cost-effective, high-performing models capable of running on any device. We’re also exploring the link between training compute time and energy consumption. Compute power for model training has been doubling at a rapid pace (opens in new tab). By reducing the required training time, our method for directly training smaller models raises the possibility of reducing energy consumption, as well. To learn more about the connections between compute and model architectures and power demands, check out the panel “Frontiers in Machine Learning: Climate Impact of Machine Learning (opens in new tab).”

This work was led by Carnegie Mellon University PhD student Misha Khodak, in collaboration with Neil Tenenholtz, Lester Mackey, and Nicolò Fusi, during a Microsoft Research internship.

Related publications

Continue reading

See all blog posts