Lecture 6: Our First Hyperparameters, Step Size and Batch Size¶

CS4787/5777 — Principles of Large-Scale Machine Learning Systems¶

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$ $\newcommand{\Exv}[1]{\mathbf{E}\left[#1\right]}$

In [1]:

import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time

Last time, we talked about general convex optimization¶

...today, I want to recapitulate this in a more intuitive setting: quadratic losses.¶

Consider the empirical risk minimization problem

$$\ell(w) = \frac{1}{n} \sum_{i=1}^n \left(\frac{1}{2} w^T A_i w + b_i^T w + \frac{1}{2} c_i \right)$$

where $w \in \R^d$, $A_i \in \R^{d \times d}$ is a symmetric positive semidefinite matrix (meaning $A_i = A_i^T$ and $w^T A_i w \ge 0$ always), $b_i \in \R_d$ and $c_i \in \R$. That is, each example loss function is just some quadratic function.

Typical example: linear regression where $A_i = x_i x_i^T$ and $b_i = -y_i x_i$ for an example-label pair $(x_i, y_i)$.

$$\frac{1}{2} (x_i^T w - y_i)^2 = \frac{1}{2} (x_i^T w - y_i)^T (x_i^T w - y_i) = \frac{1}{2} w^T x_i x_i^T - y_i x_i^T w + \frac{1}{2} y_i^2.$$

If we let $A$, $b$, and $c$ denote

$$A = \frac{1}{n} \sum_{i=1}^n A_i \hspace{4em}b = \frac{1}{n} \sum_{i=1}^n b_i \hspace{4em}c = \frac{1}{n} \sum_{i=1}^n c_i$$

then we can write the loss as

$$\ell(w) = \frac{1}{2} w^T A w + b^T w + \frac{1}{2} c.$$

We'll assume that $A$ is positve definite, here (i.e. $A$ is invertable). The gradient is

$$\nabla \ell(w) = Aw + b$$

(note: we used the fact that $A = A^T$ to derive this gradient!). The "solution" can at this point be got by setting this equal to zero, yielding $w_* = -A^{-1} b$. Observe also that

$$\ell(w) = \frac{1}{2} (w - w_*)^T A (w - w_*) - \frac{1}{2} (w_*)^T A w_* + \frac{1}{2} c = \frac{1}{2} (w - w_*)^T A (w - w_*) + \ell_*,$$

where $\ell_* = \ell(w_*)$ is the minimum value of the loss function. Here, gradient descent would do

$$w_{t+1} = w_t - \alpha (A w_t + b).$$

If we subtract the "solution" $w_*$ from both sides, we get

$$w_{t+1} - w_* = w_t - w_* - \alpha (A w_t + b).$$

If we observe that $b = -A w_*$,

$$w_{t+1} - w_* = w_t - w_* - \alpha (A w_t - A w_*) = (I - \alpha A) (w_t - w_*).$$

Now applying this inductively, we conclude that after $t$ total steps of GD

$$w_t - w_* = (I - \alpha A)^t (w_0 - w_*),$$

where here $(I - \alpha A)^t$ denotes the $t$th power of the matrix: this matrix multiplied with itself $t$ times. And of course the loss is $$\ell(w_t) = (w_0 - w_*)^T (I - \alpha A)^t A (I - \alpha A)^t (w_0 - w_*) + \ell^* = (w_0 - w_*)^T A (I - \alpha A)^{2t} (w_0 - w_*) + \ell^*$$ (where this last equality holds because all the matrices commute). Now if we let $u_1, \ldots, u_d \in \R^d$ be a full complement of eigenvectors of $A$ with corresponding eigenvalues $\lambda_1, \ldots, \lambda_d$, such that $A = \sum_{i=1}^d \lambda_i u_i u_i^T$, then $$\ell(w_t) = \ell^* + \sum_{i=1}^d \lambda_i (1 - \alpha \lambda_i)^{2t} \left( u_i^T (w_0 - w_*) \right)^2 \approx \mathcal{O}\left( \left( \max_{i} \; |1 - \alpha \lambda_i| \right)^{2t} \right).$$

How will this behave asymptotically? What does this suggest about how we should set the step size?¶

It's going to tends to be dominated by whatever the largest value of $|1 - \alpha \lambda_i|$ is! For this to converge as fast as possible, we want that maximum to be as small as possible.

Let the largest eigenvalue of $A$ be called $\lambda_1 = L$ and the smallest one $\lambda_d = \mu$. Then

$$1 - \alpha L \le 1 - \alpha \lambda_i \le 1 - \alpha \mu.$$

For the bound on the max of the absolute values of these terms to be minimized, we want

$$-(1 - \alpha L) = 1 - \alpha \mu$$

which happens when we set

$$\alpha = \frac{2}{L + \mu}$$

in this case

$$| 1 - \alpha \lambda_i | \le \frac{L - \mu}{L + \mu}$$

and so

$$\ell(w_t) - \ell^* \approx \mathcal{O}\left( \left( \frac{L - \mu}{L + \mu} \right)^{2t} \right).$$

If we define the condition number of this problem to be $\kappa = L / \mu$, then

$$\ell(w_t) - \ell^* \approx \mathcal{O}\left( \left( \frac{\kappa - 1}{\kappa + 1} \right)^{2t} \right).$$$$\ell(w_t) - \ell^* \approx \mathcal{O}\left( \left( 1 - \frac{2}{\kappa + 1} \right)^{2t} \right).$$$$\ell(w_t) - \ell^* \approx \mathcal{O}\left( \exp\left( - \frac{4t}{\kappa + 1} \right) \right).$$

How many steps do we need to run to achieve $\ell(w_t) - \ell^* \le \epsilon$? About

$$t \approx \mathcal{O}\left( \kappa \log\left( \frac{1}{\epsilon} \right) \right).$$

Stochastic Gradient Descent¶

Say we have the same objective. Stochastic gradient descent will do

$$w_{t+1} = w_t - \alpha (\tilde A_t w_t + \tilde b_t)$$

where $\tilde A_t$ and $\tilde b_t$ are from the example chosen at random to use to update the weights at step $t$. (Observe that $\mathbf{E}[\tilde A_t] = A$ and $\mathbf{E}[\tilde b_t] = b$.) Let's simplify even further and assume that all $A_i = A$ and that $d = 1$ (that is, we're optimizing a 1-dimensional problem). Under this assumption, SGD looks like

$$w_{t+1} = w_t - \alpha (A w_t + \tilde b_t) = w_t - \alpha (A w_t + b) - \alpha (\tilde b_t - b).$$

This means that this is just the GD update step with an added $ - \alpha (\tilde b_t - b)$ at the end, and so (using the same simplification as we used for GD) we can derive that for SGD

$$w_{t+1} - w_* = (1 - \alpha A) (w_t - w_*) - \alpha (\tilde b_t - b).$$

Here, the loss is just $\ell(w) = \frac{1}{2} A (w - w_*)^2 + \ell_*$, so we really just care about $(w - w_*)^2$. But observe that

$$\left(w_{t+1} - w_* \right)^2 = (1 - \alpha A)^2 (w_t - w_*)^2 - 2 \alpha (\tilde b_t - b) (1 - \alpha A) (w_t - w_*) + \alpha^2 (\tilde b_t - b)^2.$$

Taking the expected value,

$$\Exv{ \left(w_{t+1} - w_* \right)^2 } = \Exv{ (1 - \alpha A)^2 (w_t - w_*)^2 } - 2 \Exv{ \alpha (\tilde b_t - b) (1 - \alpha A) (w_t - w_*) } + \alpha^2 \Exv{ (\tilde b_t - b)^2 }.$$

Because $\Exv{\tilde b_t - b} = 0$ (since $\Exv{\tilde b_t} = b$), this becomes

$$\Exv{ \left(w_{t+1} - w_* \right)^2 } = (1 - \alpha A)^2 \Exv{ (w_t - w_*)^2 } + \alpha^2 \Exv{ (\tilde b_t - b)^2 }.$$

Now, let's define

$$\sigma^2 = \Exv{ (\tilde b_t - b)^2 }.$$

This represents the noisiness of our dataset: how much the examples we have tend to "disagree" from each other as to what the gradient "should" be. From this, we get

$$\Exv{ \left(w_{t+1} - w_* \right)^2 } = (1 - \alpha A)^2 \Exv{ (w_t - w_*)^2 } + \alpha^2 \sigma^2.$$

And if we let $\rho_t = \Exv{ (w_t - w_*)^2 }$ this simplifies to

$$\rho_{t+1} = (1 - \alpha A)^2 \rho_t + \alpha^2 \sigma^2.$$

Subtracting from both sides yields

$$\rho_{t+1} - \frac{\alpha}{2 A - \alpha A^2} \sigma^2 = (1 - \alpha A)^2 \rho_t - \frac{\alpha \sigma^2}{2 A - \alpha A^2} + \alpha^2 \sigma^2 = (1 - \alpha A)^2 \left( \rho_t - \frac{\alpha \sigma^2}{2 A - \alpha A^2} \right).$$

It follows that, applying this recursively,

$$\rho_t = (1 - \alpha A)^{2t} \left( \rho_0 - \frac{\alpha \sigma^2}{2 A - \alpha A^2} \right) + \frac{\alpha \sigma^2}{2 A - \alpha A^2}.$$

For sufficiently small $\alpha$, this means the loss (which is just a multiple of $(w_t - w_*)^2$ plus the minimum loss) scales like

$$\Exv{ \ell(w) - \ell^* } \approx \mathcal{O}\left( (1 - \alpha A)^{2t} \rho_0 + \frac{\alpha \sigma^2}{2 A} \right).$$$$\Exv{ \ell(w) - \ell^* } \approx \mathcal{O}\left( \exp(- 2 \alpha A t) \rho_0 + \frac{\alpha \sigma^2}{2 A} \right).$$

Observe that one of these terms goes down with $t$ and another doesn't! This is the noise ball phenomenon.

What if we want to make this loss gap less than $\epsilon$? Well, one way that (roughly) suffices is to make each of these two terms less than $\epsilon/2$.

$$\alpha = \frac{\epsilon A}{\sigma^2}$$

suffices to get the second term and,

$$t = \frac{1}{2 \alpha A} \log\left( \frac{2 \rho_0}{\epsilon} \right)$$$$t = \frac{\sigma^2}{2 \epsilon A^2} \log\left( \frac{2 \rho_0}{\epsilon} \right)$$

suffices for the second, so we get

$$t \approx \tilde{\mathcal{O}}\left( \frac{\sigma^2}{\epsilon A^2}\right),$$

where the $\tilde{\mathcal{O}}$ notation hides log terms. (We can make this rigorous, but we're just building intuition here!) Compare gradient descent with

$$t \approx \mathcal{O}\left( \kappa \log\left( \frac{1}{\epsilon} \right) \right).$$

What's the tradeoff?¶

Gradient descent: better scaling with $\epsilon$ when we want very accurate solutions.
Stochastic gradient descent: each example costs less.

This is an example of a systems-cost vs. statistical-cost tradeoff. The first of many we'll see in this class!

Minibatching¶

Very simple idea. Instead of using just one example, use many examples in each step.

Actually quite easy to analyze! Up to here, it's basically the same as the single-example SGD

$$\Exv{ \left(w_{t+1} - w_* \right)^2 } = (1 - \alpha A)^2 \Exv{ (w_t - w_*)^2 } + \alpha^2 \Exv{ (\tilde b_t - b)^2 }.$$

except with minibatching, $\tilde b_t$ is the average of $B$ randomly chosen examples ($B$ is the batch size), instead of one example. Because this is an average of $B$ independent and identically distributed random variables, its variance is $1/B$ of the variance of one, i.e.

$$\Exv{ (\tilde b_t - b)^2 } = \frac{\sigma^2}{B}.$$

All the rest of the analysis goes through identically, with $\sigma^2$ replaced by $\sigma^2 / B$. So our final steps needed becomes

$$t \approx \tilde{\mathcal{O}}\left( \frac{\sigma^2}{\epsilon A^2 B}\right).$$

What is the effect on the cost of each iteration?

What is the effect on the number of iterations we need to run to achieve a target error?

In [ ]: