A Deep Learning Theory: Global minima and over-parameterization
One empirical finding in deep learning is that simple methods such as stochastic gradient descent (SGD) have a remarkable ability to fit training data. From a capacity perspective, this may not be surprising— modern neural…