Convergence Analysis of Two-layer Neural Networks with ReLU Activation



Yang Yuan

Monday, April 17th
4:00pm 122 Gates Hall




In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of when and why SGD can train neural networks in practice is largely missing.

In this paper, we shed light on this mystery by providing convergence analysis for SGD on two-layer feedforward networks with ReLU activations. We prove that, with standard O(1/\sqrt{d}) weight initialization and a "residual link", SGD converges to the global minimum in polynomial number of steps. Unlike traditional theorems, our convergence has "two phases". In phase I, a potential function g gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the residual link is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.

To the best of our knowledge, this is the first convergence result of SGD for neural network with nonlinear activations.

This is joint work with from Yuanzhi Li (Princeton).