Lecture 5: Stochastic gradient descent¶

CS4787 — Principles of Large-Scale Machine Learning Systems¶

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time

Where we left off...¶

Gradient descent converges, both for strongly convex loss functions, in which case (with appropriate step size setting) it is guaranteed to reach an objective gap of no more than $\epsilon$ (i.e. $f(w_T) - f^* \le \epsilon$) after $T$ iterations if

$$T \ge \kappa \cdot \log\left( \frac{f(w_0) - f^*}{\epsilon} \right)$$

where $\kappa$ is the condition number and measures how hard the problem is to solve. We also saw that GD converges even under weaker conditions, where all we have is an L-smoothness bound, in which case for the largest allowable step size $\alpha = 1/L$ we'd get

$$\min_{t \in \{0,\ldots,T-1\}} \| \nabla f(w_t) \|^2 \le \frac{2L (f(w_0) - f^*)}{T}.$$

Stochastic Gradient Descent¶

Basic idea: in gradient descent, just replace the full gradient (which is a sum) with a single gradient example. Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by sampling a random index $\tilde i_t$ uniformly from $\{1, \ldots, n\}$ and then updating

$w_{t+1} = w_t - \alpha_t \cdot \nabla f_{\tilde i_t}(w_t) = w_t - \alpha_t \cdot \nabla \ell(w_t; x_{i_t}, y_{i_t})$

where as usual $w_t$ is the value of the parameter vector at time $t$, $\alpha_t$ is the learning rate or step size, and $\nabla f_i$ denotes the gradient of the loss function of the $i$th training example. Compared with gradient descent and Newton's method, SGD is simple to implement and runs each iteration faster.

A potential objection!¶

This is not necessarily going to be decreasing the loss at every step!

Because we're just moving in a direction that will decrease the loss for one particular example: this won't necessarily decrease the total loss!
So we can't demonstrate convergence by using a proof like the one we used for gradient descent, where we showed that the loss decreases at every iteration of the algorithm.
The fact that SGD doesn't always improve the loss at each iteration motivates the question: does SGD even work? And if so, why does SGD work?

This time, let's do the derivation on the white board!¶

If you want the full thing in text form, it's in the notes.