Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

A Deep Learning Theory: Global minima and over-parameterization

December 10, 2018 | By Zeyuan Allen-Zhu, Senior Researcher; Yuanzhi Li, Microsoft Research Intern; Zhao Song, Microsoft Research Intern


One empirical finding in deep learning is that simple methods such as stochastic gradient descent (SGD) have a remarkable ability to fit training data. From a capacity perspective, this may not be surprising— modern neural networks are heavily over-parameterized, with the number of parameters much larger than the number of training samples. In principle, there exist parameters to achieve 100% accuracy. Yet, from a theory perspective, why and how SGD finds global minima over the set of non-convex and non-smooth neural networks remains somewhat mysterious.

Our team of researchers at Microsoft Research in Redmond, Washington resolved to shed some light on this mystery. Our findings are available on arXiv.

Our main finding demonstrates that, for state-of-the-art network architectures such as fully-connected neural networks, convolutional networks (CNN), or residual networks (Resnet), assuming there are n training samples without duplication, as long as the number of parameters is polynomial in n, first-order methods such as SGD can find global optima of the training objective efficiently, that is, with running time only polynomially dependent on the total number of parameters of the network.

To better understand this finding, the following picture illustrates the main technical contribution of this our work. With the help of over-parameterization, surprisingly, there won’t be local minima blocking the SGD training; this means that as long as the training objective is large, its gradient—that is, backpropagation—also is large. However, this theorem alone does not give us the convergence result, because a neural network with ReLU activation is not even smooth. This suggests that myopically following the (opposite) gradient direction can, in principle, even increase the objective value. Fortunately, the next theorem in our paper showed that over-parameterization also makes the network “semi-smooth.” This implies if one moves in this opposite gradient direction, the objective value can indeed decrease. These two theorems together imply that SGD finds global minima on over-parameterized neural networks.

Landscape of neural network objective near the SGD training curve.

So, what else can we train using SGD, provably? This article only covers feedforward neural networks, such as CNN or ResNet. What about other more complex deep learning architectures? How about recurrent neural networks (RNNs), widely used in natural language processing? RNN is more difficult to analyze than DNN; indeed, we devoted a paper to addressing its training convergence.

Finally, what about getting good accuracy on testing data? This article only covers finding global optimization for the training data. How does the trained network generalize to test data? To properly answer this question, we teamed up with University of Wisconsin–Madison Professor Yingyu Liang to work out a generalization theory for such over-parameterized neural networks trained by SGD. The theory currently only works for neural networks with less than four layers, but more discoveries in deep learning are awaiting!

We would like to express thanks to Microsoft Research colleagues Harry Shum, Ofer Dekel, Sébastien Bubeck and Greg Yang for helpful conversations that made this work possible.

Up Next

animation of reinforcement learning agents beating human competitors in Atari

Artificial intelligence

Finding the best learning targets automatically: Fully Parameterized Quantile Function for distributional RL

Reinforcement learning has achieved great success in game scenarios, with RL agents beating human competitors in such games as Go and poker. Distributional reinforcement learning, in particular, has proven to be an effective approach for training an agent to maximize reward, producing state-of-the-art results on Atari games, which are widely used as benchmarks for testing […]

Li Zhao

Senior Researcher

Artificial intelligence

Metalearned Neural Memory: Teaching neural networks how to remember

Memory is an important part of human intelligence and the human experience. It grounds us in the current moment, helping us understand where we are and, consequently, what we should do next. Consider the simple example of reading a book. The ultimate goal is to understand the story, and memory is the reason we’re able […]

Tsendsuren Munkhdalai

Senior Researcher

Artificial intelligence

Efficient inference for dynamical models using variational autoencoders

Dynamical systems theory provides a mathematical framework for studying how complex systems evolve over time, such as the neurons in our brains, the global climate system, or engineered cells. But predicting how these systems will behave in the future or how they might respond to disruption requires an accurate model. Even writing down a model […]

Neil Dalchau

Principal Scientist