Neural networks have made a startling comeback during the past decade, rebranded as “deep learning.” The empirical success of neural networks is phenomenal but poorly understood. The state-of-the-art seems to change every few months, prompting some to call it alchemy and others to suggest that wholly new mathematical approaches are required to understand neural networks. Contrary to this, I argue that deep learning can be understood by adapting standard nonparametric statistical theory and methods to the neural network setting. Our main result is this: neural networks are exact solutions to nonparametric estimation problems in “mixed variation” function spaces. The spaces, characterized by notions of total variation in the Radon (transform) domain, include multivariate functions that are very smooth in all but a small number of directions. Spatial inhomogeneity of this sort leads to a fundamental gap between the performance of neural networks and linear methods (which include kernel methods), explaining why neural networks can outperform classical methods for high-dimensional function estimation. Our theory provides new insights into the practices of “weight decay,” “overparameterization,” and adding linear connections and layers to network architectures. It yields a deeper understanding of the role of sparsity and (avoiding) the curse of dimensionality. And lastly, the theory leads to new and improved neural network architectures and regularization methods.
This talk is based on joint work with Rahul Parhi.