Lecture 5: Scaling to complex models by learning with optimization algorithms¶

CS4787/5777 — Principles of Large-Scale Machine Learning Systems¶

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

Gradient descent
Convex optimization as a simple model
Conditioning

In [1]:

import numpy
import scipy
import matplotlib
import time

Review: The Empirical Risk¶

Suppose we have a dataset $\mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \}$, where $x_i \in \mathcal{X}$ is an example and $y_i \in \mathcal{Y}$ is a label. Let $h: \mathcal{X} \rightarrow \mathcal{Y}$ be a hypothesized model (mapping from examples to labels) we are trying to evaluate. Let $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}_+$ be a non-negative loss function which measures how different two labels are.

The empirical risk is $$ R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i). $$ Most notions of error or accuracy in machine learning can be captured with an empirical risk. A simple example: measure the error rate on the dataset with the 0-1 Loss Function $$ L(\hat y, y) = 0 \text{ if } \hat y = y \text{ and } 1 \text{ if } \hat y \ne y. $$

What are other examples you've seen of empirical risk and loss functions?¶

Loss function:

Mean squared error
Hinge loss
Cross Entropy loss
Absolute error
Negative log likelihood
ROC

Regularizers:

$\ell_2$ and $\ell_1$ regularization

The Cost of Computing the Empirical Risk¶

$$ R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i). $$

We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimization) and testing, so it's nice if we can do it fast.

Question: how does computing the empirical risk scale? Things affect the cost of computation.

The Cost of Computing the Empirical Risk¶

$$ R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i). $$

We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimization) and testing, so it's nice if we can do it fast.

Question: how does computing the empirical risk scale? Three things affect the cost of computation.

The number of training examples $n$. The cost will certainly be proportional to $n$.
The cost to compute the loss function $L$.
The cost to evaluate the hypothesis $h$.

Review: Empirical Risk Minimization¶

We don't just want to compute the empirical risk: we also want to minimize it. To do so, we parameterize the hypotheses using some parameters $w \in \R^d$. That is, we assign each hypothesis a $d$-dimensional vector of parameters and vice versa and solve the optimization problem \begin{align*} \mbox{minimize: }& R(h_w) = \frac{1}{n} \sum_{i=1}^n L(h_w(x_i), y_i) = f(w) = \frac{1}{n} \sum_{x \in \mathcal{D}} f(w; x) \\ \mbox{ over }& w \in \R^d \end{align*} where $h_w$ denotes the hypothesis associated with the parameter vector $w$ and $\mathcal{D}$ denotes the dataset. This is an instance of a principle of scalable ML:

Write your learning task as an optimization problem, then solve it with an optimization algorithm.

Gradient descent (GD).¶

Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by running $$ w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w; x) $$ where $w_t$ is the value of the parameter vector at time $t$, $\alpha_t$ is a parameter called the learning rate or step size, and $\nabla f$ denotes the gradient (vector of partial derivatives) of $f$.

Gradient descent is not the only classic optimization method though...

Another iterative method of optimization: Newton's method.¶

Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by running $$ w_{t+1} = w_t - \left( \nabla^2 f(w_t) \right)^{-1} \nabla f(w_t). $$ Newton's method can converge more quickly than gradient descent, because it is a second-order optimization method (it uses the second derivative of $f$) whereas gradient descent is only a first-order method (it uses the first derivative of $f$).

Question: what is the computational cost of running a single step of gradient descent and Newton's method? Not including the training set, how much memory is required?

Suppose that computing gradients of one example takes $\Theta(d)$ time and computing the Hessian of one example takes $\Theta(d^2)$ time, and express your answer in terms $n$ and $d$.

Gradient descent: $ w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w; x) $ Time? Memory?

Time: $O(nd)$

Memory: $O(d)$

Newton's method: $ w_{t+1} = w_t - \left( \nabla^2 f(w_t) \right)^{-1} \nabla f(w_t). $ Time? Memory?

Time: $O(nd^2 + d^3) = O(nd + nd^2 + d^3 + d^2 + d)$

Memory: $O(d^2)$

Why does gradient descent work?¶

...and what does it mean for it to work?

Intuitively, GD decreases the value of the objective at each iteration, as long as the learning rate is small enough and the value of the gradient is nonzero. Eventually, this should result in gradient descent coming close to a point where the gradient is zero. We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded (this is sometimes called an $L$-smoothness bound).

Why does this make sense?
- The second derivative being bounded means that our evaluation of the gradient at $w$ is a good approximation for the gradient nearby $w$
There are other assumptions we could use here also, but this is the simplest one.

Lipschitz continuous gradient assumption (or the $L$-smoothness assumption)¶

Suppose that for some constant $L > 0$, for all $x$ and $y$ in $\R^d$,

$$ \| \nabla f(x) - \nabla f(y) \| \le L \| x - y \|. $$

Here, $\| \cdot \|$ denotes the $\ell_2$ norm: $\|u\| = \sqrt{\sum_{i=1}^d u_i^2}$.

It's saying the the gradient can’t change arbitrarily fast
This is equivalent to the condition that $\| \nabla^2 f(x) \|_2 \le L$ (where the norm here denotes the induced norm) if the function is twice-differentiable.

Starting from this condition, let's look at how the objective changes over time as we run gradient descent with a fixed step size.

We will first prove the descent lemma, that is, for all $x,y$,

$$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2 $$

Consider the univariate function and its derivative $$ \begin{align} \rho(\tau) &= f(x+\tau(y-x)) \\ \rho'(\tau) &= \langle \nabla f(x+\tau(y-x)),\, y-x\rangle \end{align} $$

By the Fundamental Theorem of Calculus,

$$ \rho(1) - \rho(0) = \int_0^1 \rho'(\tau) \, d\tau $$

If we substitute back in our definition of $\rho$, we get $$ \begin{align} f(y) - f(x) &= \int_0^1 \langle \nabla f(x+\tau(y-x)),\, y-x> \,d\tau \\ f(y) &= f(x) + \langle \nabla f(x),\, y-x \rangle + \int_0^1 \langle \nabla f(x+\tau(y-x)) - \nabla f(x) ,\, y-x> \,d\tau \end{align} $$ where in the last step we rearranged, followed by an add and subtract of $\langle \nabla f(x),\, y-x \rangle$.

We can now simplify this as follows. Keep in mind: we assume that for all $x,y$, $\norm{\nabla f(x) - \nabla f(y)} \leq L \norm{x-y}$. $$ \begin{align} f(y) &= f(x) + \langle \nabla f(x),\, y-x \rangle + \int_0^1 \langle \nabla f(x+\tau(y-x)) - \nabla f(x) ,\, y-x> \,d\tau \\ &\leq f(x) + \langle \nabla f(x),\, y-x \rangle + \int_0^1 \norm{ \nabla f(x+\tau(y-x)) - \nabla f(x)} \norm{y-x} \,d\tau \\ &\leq f(x) + \langle \nabla f(x),\, y-x \rangle +\int_0^1 L \norm{ x + \tau(y-x) - x } \norm{y-x} \,d\tau \\ &= f(x) + \langle \nabla f(x),\, y-x \rangle + L\norm{y-x}^2 \int_0^1 \tau \,d\tau \\ \end{align} $$ where the first inequality comes from Cauchy-Schwarz inequality $a^\top b \leq \norm{a}\norm{b}$, and the second comes from our $L$-Lipschitz continuous gradient assumption. which gives us the descent lemma $$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2. $$

Descent lemma: Lipschitz conitnuous gradient implies a global quadratic upper bound¶

$$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2. $$ No description has been provided for this image

(Figure shamelessly stolen from Mark Schmidt)

Why the descent lemma?¶

$$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2. $$

Since it holds for all $x,y$, we can plug in $w_t$ for $x$ and $w_{t+1}$ for $y$, and plug in the GD update $w_{t+1} = w_t - \alpha \nabla f(w_t)$, $$ \begin{align} f(w_{t+1}) &\leq f(w_t) + \langle \nabla f(w_t), \, w_{t+1} - w_t \rangle + \frac{L}{2} \norm{w_{t+1} - w_t}^2 \\ &= f(w_t) - \alpha \langle \nabla f(w_t), \, \nabla f(w_t) \rangle + \frac{L}{2} \alpha^2 \norm{\nabla f(w_t)}^2 \\ &= f(w_t) - \alpha\left(1-\frac{1}{2} \alpha L \right) \norm{\nabla f(w_t)}^2 \end{align} $$

Why the descent lemma?¶

So, we got to $$ \begin{align} f(w_{t+1}) &\leq f(w_t) - \alpha\left(1-\frac{1}{2} \alpha L \right) \norm{\nabla f(w_t)}^2 \end{align} $$ If we choose our step size $\alpha$ to be smalle nough such that $\alpha L \leq 1$, then $$f(w_{t+1}) \le f(w_t)

\frac{\alpha}{2} \cdot | \nabla f(w_t) |^2$$

That is, the objective is guaranteed to decrease at each iteration! (Hence, descent)

This matches our intuition for why GD should work.

Convergence of GD¶

$$f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \cdot \| \nabla f(w_t) \|^2$$

Now, if we sum this up across $T$ iterations of gradient descent, we get

$$ \frac{1}{2} \alpha \sum_{t = 0}^{T-1} \norm{ \nabla f(w_t) }^2 \le \sum_{t = 0}^{T-1} \left( f(w_t) - f(w_{t+1}) \right) = f(w_0) - f(w_T) \le f(w_0) - f^* $$

where $f^*$ is the global minimum value of the loss function $f$. (Usually $f^* \ge 0$.) From here, we can get

$$ \min_{t \in \{0, \ldots, T\}} \norm{ \nabla f(w_t) }^2 \le \frac{1}{T} \sum_{t = 0}^{T-1} \norm{ \nabla f(w_t) }^2 \le \frac{2 (f(w_0) - f^*)}{\alpha T}. $$

This means that the smallest gradient we observe after $T$ iterations is getting smaller proportional to $1/T$. So gradient descent converges...

as long as we look at the smallest observed gradient
and we care about finding a point where the gradient is small

Question: does this ensure that gradient descent converges to the global optimum (i.e. the value of $w$ that minimizes $f$ over all $w \in \R^d$)?

...

Question: does this ensure that gradient descent converges to the global optimum (i.e. the value of $w$ that minimizes $f$ over all $w \in \R^d$)?

No, because $f$ can have multiple local minima, and $\norm{\nabla f(w)} = 0$ only implies a first-order stationary point. It can even be a local maximum or saddle point!

When can we ensure that gradient descent does converge to the unique global optimum?¶

Certainly, we can do this when there is only one global optimum.

The simplest case of this is the case of convex objectives.

Convex Functions¶

A function $f: \R^d \rightarrow \R$ is convex if for all $x, y \in \R^d$, and all $\eta \in [0,1]$

$$f(\eta x + (1 - \eta) y) \le \eta f(x) + (1 - \eta) f(y).$$

What this means graphically is that if we draw a line segment between any two points in the graph of the function, that line segment will lie above the function. There are a bunch of equivalent conditions that work if the function is differentiable.

(Figure shamelessly stolen from Mark Schmidt)

Convex Functions¶

In terms of the gradient, for all $x, y \in \R^d$,

$$(x - y)^T \left( \nabla f(x) - \nabla f(y) \right) \ge 0,$$

and in terms of the Hessian, for all $x \in \R^d$ and all $u \in \R^d$

$$u^T \nabla^2 f(x) u \ge 0;$$

this is equivalent to saying that the Hessian is positive semidefinite.

Question: What are some examples of hypotheses and loss functions that result in a convex objective??

$$ R(w) = \frac{1}{n} \sum_{i=1}^n L(h(w;x_i), y_i). $$

...

Question: What are some examples of hypotheses and loss functions that result in a convex objective??

$$ R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i). $$

Linear regression $R(w) = \frac{1}{n} \sum_{i=1}^n (w^T x_i - y_i)^2$
Lasso $R(w) = \frac{1}{n} \sum_{i=1}^n |w^T x_i - y_i|$
SVM $R(w) = \frac{1}{n}\sum_{i=1}^n \max\{0,\, 1-y_iw^Tx_i \} + \frac{\lambda}{2}\norm{w}^2$
...

An even easier case than convexity: strong convexity.

A function is strongly convex with parameter $\mu > 0$ if for all $x, y \in \R^d$,

$$f(y) \ge f(x) + \nabla f(x)^T (y - x) + \frac{\mu}{2} \norm{x - y}^2$$ No description has been provided for this image

A global quadratic lower bound!

(Figure shamelessly stolen from Mark Schmidt)

An even easier case than convexity: strong convexity.

or equivalently, for all $x \in \R^d$ and all $u \in \R^d$

$$u^T \nabla^2 f(x) u \ge \mu \norm{u}^2.$$

One (somewhat weaker) condition that is implied by strong convexity is

$$\norm{\nabla f(x)}^2 \ge 2 \mu \left( f(x) - f^* \right);$$

this is sometimes called the Polyak-Łojasiewicz condition.

A useful note: you can make any convex objective strongly convex by adding $\ell_2$ regularization.

Gradient descent on strongly convex objectives.¶

As before, let's look at how the objective changes over time as we run gradient descent with a fixed step size.

This is a standard approach when analyzing an iterative algorithm like gradient descent.

From our descent lemma earlier, we had (as long as $1 \ge \alpha L$)

$$f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \cdot \| \nabla f(w_t) \|^2.$$

Now applying the Polyak-Lojasiewicz condition condition to this gives us

$$ f(w_{t+1}) \le f(w_t) - \alpha \mu \left( f(w_t) - f^* \right).$$

Subtracting the global optimum $f^*$ from both sides produces

$$ f(w_{t+1}) - f^* \le f(w_t) - f^* - \alpha \mu \left( f(w_t) - f^* \right) = (1 - \alpha \mu) \left( f(w_t) - f^* \right). $$

So we got that

$$f(w_{t+1}) - f^* \le (1 - \alpha \mu) \left( f(w_t) - f^* \right).$$

But this means that

\begin{align*} f(w_{t+2}) - f^* &\le (1 - \alpha \mu) \left( f(w_{t+1}) - f^* \right) \\&\le (1 - \alpha \mu) \left( (1 - \alpha \mu) \left( f(w_t) - f^* \right) \right) \\&\le (1 - \alpha \mu)^2 \left( f(w_t) - f^* \right). \end{align*}

Applying this inductively, after $K$ total steps

$$f(w_K) - f^* \le (1 - \alpha \mu)^K \left( f(w_0) - f^* \right) \le \exp(-\alpha \mu K) \cdot \left( f(w_0) - f^* \right).$$

This shows that, for strongly convex functions, gradient descent with a constant step size converges exponentially quickly to the optimum. This is sometimes called convergence at a linear rate.

If we use the largest step size that satisfies our earlier assumption that $1 \ge \alpha L$ (i.e. $\alpha = 1/L$), then this rate becomes

$$ f(w_K) - f^* \le \exp\left(-\frac{\mu K}{L} \right) \cdot \left( f(w_0) - f^* \right).$$

Equivalently, in order to ensure that $f(w_K) - f^*$ for some target error $\epsilon > 0$, it suffices to set the number of iterations $K$ large enough that

$$K \ge \frac{L}{\mu} \cdot \log\left( \frac{f(w_0) - f^*}{\epsilon} \right).$$

The main thing that affects how large $K$ needs to be here is the ratio $L / \mu$, which is a function just of the problem itself and not of how we initialize or how accurate we want our solutions to be. We call the ratio

$$\kappa = \frac{L}{\mu}$$

the condition number of the problem. The condition number encodes how hard a strongly convex problem is to solve.

Observe that this ration is invariant to scaling of the objective function $f$.

When the condition number is very large, the problem can be very hard to solve, and take many iterations of gradient descent to get close to the optimum. Even if the cost of running gradient descent is tractible as we scale, if scaling a problem causes the condition number to blow up, this can cause issues!

An aside...why do we need sufficiently small step size?¶

With too large a step size, we can overshoot the optimum!

Consider the following case of $f(w) = \frac{1}{2} w^2$. Here $f'(w) = \nabla f(w) = w$ (and $f''(w) = 1$), so it's $L$-smooth with $L = 1$. Suppose we're at $w_t = 2$ as shown here:

In [2]:

%matplotlib inline

x = numpy.arange(-2,3,0.01)
y = x**2 / 2
pyplot.plot(x,y, label=r"$f(w) = \frac{1}{2}w^2$");
pyplot.scatter([2.0],[2.0**2/2], c="r", zorder=10)
pyplot.legend()

Out[2]:

<matplotlib.legend.Legend at 0x12f9e4fd0>

Here our GD update will be $w_{t+1} = w_t - \alpha f(w_t) = 2 - \alpha \cdot 2$. If we step with $\alpha = 1$, we go directly to the optimum at $w = 0$. But! If we step with a larger $\alpha$, we overshoot, and for $\alpha > 2$, our loss $f(w)$ actually increases. This illustrates why having sufficiently small steps is necessary for our proof.

Why is our L-smoothness assumption necessary?¶

We can also use a simple example to illustrate why assuming some sort of smoothness is necessary to show this result. Consider the (dumb) example of $f(w) = \sqrt{|w|}$. It is differentiable everywhere except at $w=0$.

In [3]:

x = numpy.arange(-3,3,0.01)
y = numpy.sqrt(numpy.abs(x))
w = 2;
prev_w = numpy.array(w)
alpha = 0.5;

for it in range(100):
    w = w - alpha * numpy.sign(w)/(2*numpy.sqrt(numpy.abs(w)))
    prev_w = numpy.append(prev_w,w)

In [4]:

pyplot.plot(x,y, label=r"$f(w) = \sqrt{|w|}$");
pyplot.scatter(prev_w, numpy.sqrt(numpy.abs(prev_w)), s=list(numpy.arange(1,102)*0.5), c="g", zorder=100, alpha=0.5)
pyplot.scatter([w], [numpy.sqrt(numpy.abs(w))], [100], c="r", zorder=100, alpha=0.5)
pyplot.legend()

Out[4]:

<matplotlib.legend.Legend at 0x12fb31090>

In [5]:

prev_w = numpy.append(prev_w,w)
pyplot.plot(prev_w, c="g", label=r"iterates $w_t$")
pyplot.plot(numpy.arange(len(prev_w)), numpy.zeros(len(prev_w)), c="black", linestyle="dotted", label=r"$w^*$")
pyplot.xlabel(r"iterations $t$")
pyplot.ylabel(r"$w_t$")
pyplot.legend()

Out[5]:

<matplotlib.legend.Legend at 0x12fd8a950>

Gradients around $w=0$ changes too fast! For $w>0$, $f(w) = \sqrt{w}$ with $f'(w) = \frac{1}{2\sqrt{w}}$, $|f''(w)| = \frac{1}{4w^{3/2}}$.
As $w\downarrow 0$, $|f''(w)| \to \infty$ so there is no global upper bound for the second derivative, which means we do not have Lipschitz continuous gradients: $$\norm{f'(y)-f'(x)} \leq L \norm{y-x}$$ for all $x,y$
GD diverged once we got close to $0$

Stochastic Gradient Descent¶

Basic idea: in gradient descent, just replace the full gradient (which is a sum) with a single gradient example. Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by sampling a random example $x_{(t)}$ uniformly from the training set and then updating

$w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t; x_{(t)})$

where as usual $w_t$ is the value of the parameter vector at time $t$, $\alpha_t$ is the learning rate or step size, and $\nabla f_i$ denotes the gradient of the loss function of the $i$th training example. Compared with gradient descent and Newton's method, SGD is simple to implement and runs each iteration faster.

A potential objection!¶

This is not necessarily going to be decreasing the loss at every step!

Because we're just moving in a direction that will decrease the loss for one particular example: this won't necessarily decrease the total loss!
So we can't demonstrate convergence by using a proof like the one we used for gradient descent, where we showed that the loss decreases at every iteration of the algorithm.
The fact that SGD doesn't always improve the loss at each iteration motivates the question: does SGD even work? And if so, why does SGD work?

Question: Why might it be fine to get an approximate solution to an optimization problem for training?¶

Because we don't always have to minimize the training loss exactly, as long as the performance on the test set is good enough. In machine learning, generalization matters more than optimization, sadly (for an optimizer).

Understanding why SGD converges¶

Assumption: $f$ is $L$-smooth, i.e. for any $x$ and $y$ $$\| \nabla f(x) - \nabla f(y) \| \le L \| x - y \|.$$

Assumption: loss $f$ is non-negative. (This is without loss of generality if $f$ is bounded from below, since we can always add a constant to $f$ in that case to make it non-negative.)

New Assumption for SGD: the variance of the gradients is bounded. For some constant $\sigma > 0$, if $x$ is drawn uniformly at random from the training set, for any $w$ $$\mathbf{E}_x\left[ \| \nabla f(w; x) - \nabla f(w) \|^2 \right] \le \sigma^2,$$ or equivalently, $$\frac{1}{n} \sum_{x \in \mathcal{D}} \| \nabla f(w; x) - \nabla f(w) \|^2 \le \sigma^2,$$ or also equivalently, $$\mathbf{E}_x\left[ \| \nabla f(w; x) \|^2 \right] \le \| \nabla f(w) \|^2 + \sigma^2.$$

Let's derive the quadratic upper bound again (the "descent" lemma)¶

that is, for all $x,y$,

$$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2 $$

To begin, consider the univariate function and its derivative $$ \begin{align} \rho(\tau) &= f(x+\tau(y-x)) \\ \rho'(\tau) &= \langle \nabla f(x+\tau(y-x)),\, y-x\rangle \end{align} $$

By the Fundamental Theorem of Calculus,

$$ \rho(1) - \rho(0) = \int_0^1 \rho'(\tau) \, d\tau $$

We can now simplify this as follows. Keep in mind: we assume that for all $x,y$, $\norm{\nabla f(x) - \nabla f(y)} \leq L \norm{x-y}$. $$ \begin{align} f(y) &= f(x) + \langle \nabla f(x),\, y-x \rangle + \int_0^1 \langle \nabla f(x+\tau(y-x)) - \nabla f(x) ,\, y-x> \,d\tau \\ &\leq f(x) + \langle \nabla f(x),\, y-x \rangle + \int_0^1 \norm{ \nabla f(x+\tau(y-x)) - \nabla f(x)} \norm{y-x} \,d\tau \\ &\leq f(x) + \langle \nabla f(x),\, y-x \rangle +\int_0^1 L \norm{ x + \tau(y-x) - x } \norm{y-x} \,d\tau \\ &= f(x) + \langle \nabla f(x),\, y-x \rangle + L\norm{y-x}^2 \int_0^1 \tau \,d\tau \\ \end{align} $$ where the first inequality comes from Cauchy-Schwarz inequality $a^\top b \leq \norm{a}\norm{b}$, and the second comes from our $L$-Lipschitz continuous gradient assumption. which gives us the "descent" lemma (although, for SGD, we no longer guarantee descent) $$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2. $$

Questions?¶

$$ f(y) \leq f(x) + \langle \nabla f(x),\, y-x \rangle + \frac{L}{2} \norm{y-x}^2. $$

Since it holds for all $x,y$, we can plug in $w_t$ for $x$ and $w_{t+1}$ for $y$, and plug in the SGD update $$w_{t+1} = w_t - \alpha_t \nabla f(w_t;x_{(t)})$$ we have $$ \begin{align} f(w_{t+1}) &\leq f(w_t) + \langle \nabla f(w_t), \, w_{t+1} - w_t \rangle + \frac{L}{2} \norm{w_{t+1} - w_t}^2 \\ &= f(w_t) - \alpha_t \langle \nabla f(w_t), \, f(w_t;x_{(t)}) \rangle + \frac{\alpha_t^2 L}{2} \norm{\nabla f(w_t;x_{(t)})}^2 \\ \end{align} $$ But the inner product term can be positive or negative, so we are not guaranteed to decrease the value of $f$ after each step!

$$f(w_{t+1}) \le f(w_t) - \alpha_t \left \langle \nabla f(w_t), \, f(w_t;x_{(t)}) \right \rangle+ \frac{\alpha_t^2 L}{2} \| \nabla f(w_t; x_{(t)}) \|^2.$$

Although this isn't a descent method, but this is going to tend to be decaying in expectation. Let's take the expectation of both sides conditioned on $w_t$. The randomness here is over the random choice of $x_{(t)}$.

$$ \mathbf{E}[ f(w_{t+1}) \mid w_t] \le \mathbf{E}\left[ f(w_t) - \alpha_t \left \langle \nabla f(w_t), \, f(w_t;x_{(t)}) \right \rangle+ \frac{\alpha_t^2 L}{2} \| \nabla f(w_t; x_{(t)}) \|^2 \middle |\, w_t\right] $$

By linearity of expectation and by factoring out constants that don't depend on the random choice of $x_{(t)}$, $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) -\alpha_t \left \langle \mathbf{E}[ \nabla f(w_t; x_{(t)}) \mid w_t ],\, \nabla f(w_t) \right\rangle + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right].$$

$$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) -\alpha_t \left \langle \mathbf{E}[ \nabla f(w_t; x_{(t)}) \mid w_t ],\, \nabla f(w_t) \right\rangle + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right]$$

Now, the first conditional expectation is $$\mathbf{E}[ \nabla f(w_t; x_{(t)}) \mid w_t ] = \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w_t; x) = \nabla f(w_t)$$ Althought $w_t$ depends on the randomness in all previous iterations, because of the conditioning on $w_t$, the only randomness here is the selection of the example $x_{(t)}$, which is uniformly at random from the training set.

So we get $$ \begin{align} \mathbf{E}[ f(w_{t+1}) \mid w_t] &\le f(w_t) - \alpha_t \left\langle \nabla f(w_t),\, \nabla f(w_t) \right\rangle + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right] \\ &= f(w_t) - \alpha_t\norm{\nabla f(w_t)}^2 + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right] \end{align} $$

$$ \mathbf{E}[ f(w_{t+1}) \mid w_t] \leq f(w_t) - \alpha_t\norm{\nabla f(w_t)}^2 + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right] $$

Recall our bounded variance assumption of the stochastic gradients: for all $w$,

$$\mathbf{E}_x\left[ \| \nabla f(w; x) \|^2 \right] \le \| \nabla f(w) \|^2 + \sigma^2.$$

Applying this to the second expecation gives us $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 L}{2} \left( \| \nabla f(w_t) \|^2 + \sigma^2 \right).$$

This simplifies to $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \left( 1 - \frac{\alpha_t L}{2} \right) \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

$$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \left( 1 - \frac{\alpha_t L}{2} \right) \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

If we constrain the step size to be sufficiently small, $\alpha_t L \le 1$, then this can be further simplified to $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \frac{\alpha_t}{2} \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 \sigma^2 L}{2}.$$ Let's call this the progress bound of SGD

Finally, let's take the full expected value over all the randomness in the algorithm, not just this conditional expectation. By the law of total expectation, this yields $$ \begin{align} \mathbf{E}[ f(w_{t+1}) ] &= \mathbf{E}[ \mathbf{E}[ f(w_{t+1}) \mid w_t] ] \\ &\le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] + \frac{\alpha_t^2 \sigma^2 L}{2} \end{align}$$

Rearranging the above expression, we can get $$\frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

Now, let's imagine that we run $K$ total iterations of SGD, and we sum both sides of this expression going from $t$ from $0$ to $K-1$. This yields $$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le \left( \sum_{t=0}^{K-1} \left( \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] \right) \right) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

$$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le \left( \sum_{t=0}^{K-1} \left( \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] \right) \right) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

Observe that this sum telescopes!

So we get $$ \begin{align} \sum_{t=0}^{K-1} \left( \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] \right) &= \mathbf{E}[ f(w_0) ] - \mathbf{E}[ f(w_K) ] \\ &= f(w_0) - \mathbf{E}[ f(w_K) ] \\ &\le f(w_0), \end{align}$$ where this last inequality follows from our assumption that the loss is non-negative. Thus, $$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

$$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

Now, let $$\rho_t = \frac{\alpha_t}{\sum_{k = 0}^{K-1} \alpha_t}.$$

Multiply by $1=\frac{\sum_{k = 0}^{K-1} \alpha_t}{\sum_{k = 0}^{K-1} \alpha_t}$ on the LHS, we have $$\left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \sum_{t=0}^{K-1} \rho_t \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

$$\left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \sum_{t=0}^{K-1} \rho_t \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$

But if we let $\tau$ be a random varible with distribution given by $\rho$ (this is a distribution because it sums to $1$), then we can write this left side in expected-value form as $$\left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \cdot \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2,$$ where now the expected value is taken over both the algorthmic randomness and the choice of $\tau$. We can think of $w_{\tau}$ as the output of SGD run for a random number of steps.

Questions?¶

Constant Step Size¶

Here, we have $\alpha_t = \alpha \leq 1/L$

$$ \begin{align} \left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \cdot \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] &\le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2 \\ \frac{\alpha K}{2} \cdot \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] &\le f(w_0) + \frac{\alpha^2 \sigma^2 L K}{2} \end{align} $$

Multiplying both sides by $\frac{2}{\alpha K}$, $$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L$$

How should we interpret this?¶

$$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L$$

SGD with constant step size converges to a "noise ball" (a.k.a. "noise floor")! Gradient magnitude doesn't necessarily go to zero

Even if we run for a very large number of iterations, $$\lim_{K \rightarrow \infty} \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L = \alpha \sigma^2 L \ne 0.$$ For many applications this is fine...but it seems somehow lacking.

How should we interpret this?¶

$$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L$$ No description has been provided for this image

(Figure shamelessly stolen from Mark Schmidt)

Question: how can we make the noise ball smaller?¶

We can make this arbitrarily small¶

$$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L$$ No description has been provided for this image

(Figure shamelessly stolen from Mark Schmidt)

We can make this arbitrarily small¶

By choosing the step size constant as a function of the number of steps $K$ we plan to run. If we minize the RHS of $$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L$$ with respect to $\alpha$, we will get $$\alpha = \min\left( \sqrt{\frac{2 f(w_0)}{\sigma^2 L K}}, \frac{1}{L} \right)$$ where the $1/L$ came from the earlier requirement of $\alpha L \leq 1$ in order to get to this point (the progress bound).

Substitute this in, $$ \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le\frac{2 f(w_0)}{K} \max\left( \sqrt{\frac{\sigma^2 L K}{2 f(w_0)}}, L \right) + \sigma^2 L \min\left( \sqrt{\frac{2 f(w_0)}{\sigma^2 L K}}, \frac{1}{L} \right) $$

Then since $\min(x,y) \le x$ and $\max(x,y) \le x + y$, \begin{align*} \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] &\le \frac{2 f(w_0)}{K} \left( \sqrt{\frac{\sigma^2 L K}{2 f(w_0)}} + L \right) + \sigma^2 L \cdot \frac{2 f(w_0)}{\sigma^2 L K} \\&= \sqrt{\frac{2 f(w_0) \sigma^2 L}{K}} + \frac{2 L f(w_0)}{K} + \sqrt{\frac{2 f(w_0)\sigma^2 L}{K}} \\&= \sqrt{\frac{8 f(w_0)\sigma^2 L}{K}} + \frac{2 L f(w_0)}{K}. \end{align*} This is decreasing to zero as $K\to\infty$.

Under strong convexity, we can do better¶

Recall the progress bound of SGD $$\mathbf{E}[ f(w_{t+1}) ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

If $f$ is $\mu$-strongly convex, we have that $\norm{\nabla f(x)}^2 \ge 2 \mu \left( f(x) - f^* \right)$, and so $$\mathbf{E}[ f(w_{t+1}) ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ 2 \mu \left( f(w_t) - f^* \right) ] + \frac{\alpha_t^2 \sigma^2 L}{2}$$

$$\mathbf{E}[ f(w_{t+1}) ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ 2 \mu \left( f(w_t) - f^* \right) ] + \frac{\alpha_t^2 \sigma^2 L}{2}$$

Subtracting the global minimum $f^*$ from both sides, $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le \mathbf{E}[ f(w_t) - f^* ] - \alpha_t \mu \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha_t^2 \sigma^2 L}{2},$$ which simplifies to $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le (1 - \alpha_t \mu) \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

Again supposing a constant step size, $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le (1 - \alpha \mu) \cdot \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha^2 \sigma^2 L}{2}.$$

Subtracting $\frac{\alpha \sigma^2 L}{2 \mu}$ from both sides, \begin{align*} \mathbf{E}\left[ f(w_{t+1}) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right] &\le (1 - \alpha \mu) \cdot \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha^2 \sigma^2 L}{2} - \frac{\alpha \sigma^2 L}{2 \mu} \\&\le (1 - \alpha \mu) \cdot \mathbf{E}\left[ f(w_t) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right]. \end{align*}

\begin{align*} \mathbf{E}\left[ f(w_{t+1}) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right]\le (1 - \alpha \mu) \cdot \mathbf{E}\left[ f(w_t) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right]. \end{align*}

We can now apply this recursively! Over $K$ total iterations, $$\begin{align} \mathbf{E}\left[ f(w_K) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right] &\le (1 - \alpha \mu)^K \cdot \left( f(w_0) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right) \\ &\le \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) \end{align}$$ where we were able to drop the expected value on the right because $w_0$ is not a random variable.

Note that we again used the inequality $(1-x) \leq \exp(-x)$ for all $x$,

In [7]:

from labellines import labelLine, labelLines
x = numpy.linspace(-2,2,100); y = 1-x; z = numpy.exp(-x)
fig = pyplot.figure(figsize=(3,3)); ax = fig.add_subplot(111)

ax.plot(x,y, linewidth=3, label=r"$1-x$");ax.plot(x,z, linewidth=3, label=r"$\exp(-x)$");labelLines(ax.get_lines(), xvals=(4, -4), fontsize=9);

This yields a final bound of $$\mathbf{E}\left[ f(w_K) - f^* \right] \le \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) + \frac{\alpha \sigma^2 L}{2 \mu}$$

$$\mathbf{E}\left[ f(w_K) - f^* \right] \le \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) + \frac{\alpha \sigma^2 L}{2 \mu}$$

Again we have the same issue of convergence to a noise ball for constant $\alpha$. We can minimize this over $\alpha$ to pick a number-of-steps-dependent step size. $$0 = -\mu K \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) + \frac{\sigma^2 L}{2 \mu} \rightarrow \alpha = \frac{1}{\mu K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right).$$

This yields \begin{align*} \mathbf{E}\left[ f(w_K) - f^* \right] &\le \frac{\sigma^2 L}{2 \mu^2 \left( f(w_0) - f^* \right) K} \cdot \left( f(w_0) - f^* \right) + \frac{\sigma^2 L}{2 \mu} \cdot \frac{1}{\mu K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right) \\&= \frac{\sigma^2 L}{2 \mu^2 K} + \frac{\sigma^2 L}{2 \mu^2 K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right) \\&= \frac{\sigma^2 L}{2 \mu^2 K} \log\left( \frac{2e \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right). \end{align*} This is indeed approaching $0$ as $K$ becomes large!

Another setting in which the noise ball is small: when all the examples agree about where $w^*$ is.¶

In [ ]: