Lecture 10: Accelerating SGD with Averaging and Variance Reduction¶

CS4787 — Principles of Large-Scale Machine Learning Systems¶

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$ $\newcommand{\Exv}[1]{\mathbf{E}\left[#1\right]}$ $\newcommand{\Prob}[1]{\mathbf{P}\left(#1\right)}$ $\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}$ $\newcommand{\Abs}[1]{\left|#1\right|}$

import numpy
import scipy
import matplotlib
from matplotlib import pyplot
import time

matplotlib.rcParams.update({'font.size': 14, 'figure.figsize': (10.0, 8.0)})

Announcements¶

Prelim is graded. The grades are up on Gradescope.

Third pset with released tonight.

Recall¶

Over the past couple of lectures, we've been talking about how to accelerate gradient-based training. Specifically, we've been focusing on the condition number $\kappa$ as a source of difficulty in solving large-scale optimization problems from empirical risk minimization.|

The number of steps of gradient descent required to guarantee a certain objective gap $\epsilon$ is proportional to the condition number, and so if an objective has a widely varying range of curvatures, it will take more time to solve.

However, we also said that we usually want to run SGD instead of GD to remove the dependence on $n$, the size of the dataset. And there was another significant term that affects the convergence of stochastic gradient descent: the variance of the noise $\sigma^2$.

The further our stochastic gradient samples are from the "true" full gradient, the larger our "noise ball" will be.
So, to achieve a target objective gap $\epsilon$, the larger $\sigma^2$ is, the smaller we'll need to set the fixed step size $\alpha$...and the more iterations we'll need to run.

We saw one way to address this when we looked at diminishing step size schemes. Today we'll look at a few more methods of doing this in more detail: increasing minibatch size schemes, averaging the iterates, and variance reduction.

A Warm-up: Reducing the dependence on the variance by increasing the minibatch size¶

Recall that, in the non-strongly-convex case, we derived the convergence rate for minibatch with-replacement SGD

$$\frac{1}{T} \sum_{t = 0}^{T - 1} \Exv{ \norm{ \nabla f(w_t) }^2 } \le \underbrace{\frac{ 2 \left( f(w_0) - f^* \right) }{\alpha T}}_{\text{like gradient descent}} + \underbrace{\frac{\alpha \sigma^2 L}{B}}_{\text{"noise ball" term}}.$$

We saw a few lectures ago that we could achieve asymptotic convergence (rather than convergence to a noise ball) by decreasing the step size as iterations increase. The motivation: we can decrease the noise ball term to 0 as we train by decreasing $\alpha$.

Another way we can decrease this noise ball term is to increase the batch size $B$. Let's look at what happens when we do that.

Recall: The assumptions we've been using¶

Assume the loss function is $L$-smooth ($\norm{\nabla f(w) - \nabla f(v)} \le L \cdot \norm{w - v}$, or equivalently for twice-differentiable functions $\Abs{ \frac{\partial^2}{\partial \eta^2} f(w + \eta v) } \le L \norm{v}^2$, i.e. the curvature in any direction is no larger in magnitude than $L$), and that the stochastic gradient variance at any $w$ is bounded by

$$\Exv{ \norm{ \nabla f_i(w) - \nabla f(w) }^2 } = \frac{1}{n} \sum_{i=1}^n \norm{ \nabla f_i(w) - \nabla f(w) }^2 \le \sigma^2.$$

Also suppose that the step size satisfies $\alpha \le 1/L$ (we used this in some of our proofs to simplify expressions by hiding $\alpha^2$ terms) and that $f$ actually has a global minimum $f^*$.

Recall that, from the analysis we did in Lecture 5 of the loss-at-the-next-timestep, we got

$$\Exv{ f(w_{t+1}) } \le \Exv{ f(w_t) } - \frac{\alpha}{2} \Exv{ \norm{ \nabla f(w_t) }^2 } + \frac{\alpha^2 \sigma^2 L}{2}.$$

Now, this is just for plain SGD (with no minibatching). If we used a with-replacement minibatch of size $B_t$ at time $t$, what would we get here instead? (Hint: think about how changing the minibatch size affects the variance of the stochastic gradient estimator.)

Using a with-replacement minibatch of size $B_t$ will decrease the variance by a factor of $B_t$. So we'd get:

$$\Exv{ f(w_{t+1}) } \le \Exv{ f(w_t) } - \frac{\alpha}{2} \Exv{ \norm{ \nabla f(w_t) }^2 } + \frac{\alpha^2 \sigma^2 L}{2 B_t}.$$

Continuing the analysis¶

Starting from

$$\Exv{ f(w_{t+1}) } \le \Exv{ f(w_t) } - \frac{\alpha}{2} \Exv{ \norm{ \nabla f(w_t) }^2 } + \frac{\alpha^2 \sigma^2 L}{2 B_t}.$$

We can move the third term to the left side and the first term to the right, and then sum up over $T$ iterations to get

$$\frac{\alpha}{2} \sum_{t=0}^{T-1} \Exv{ \norm{ \nabla f(w_t) }^2 } \le \sum_{t=0}^{T-1} \left( \Exv{ f(w_t) } - \Exv{ f(w_{t+1}) } + \frac{\alpha^2 \sigma^2 L}{2 B_t} \right).$$

Most of this sum telescopes, so it simplifies to

\begin{align*}\frac{\alpha}{2} \sum_{t=0}^{T-1} \Exv{ \norm{ \nabla f(w_t) }^2 } &\le \Exv{ f(w_0) } - \Exv{ f(w_T) } + \frac{\alpha^2 \sigma^2 L}{2} \sum_{t=0}^{T-1} \frac{1}{B_t} \\&\le f(w_0) - f^* + \frac{\alpha^2 \sigma^2 L}{2} \sum_{t=0}^{T-1} \frac{1}{B_t},\end{align*}

where this last line follows since we let $f^* = \min_w f(w)$, so for any $w$, $-f(w) \le -f^*$.

Dividing both sides by $\alpha T / 2$ produces

$$\frac{1}{T} \sum_{t=0}^{T-1} \Exv{ \norm{ \nabla f(w_t) }^2 } \le \frac{2 ( f(w_0) - f^* ) }{\alpha T} + \frac{\alpha \sigma^2 L}{T} \sum_{t=0}^{T-1} \frac{1}{B_t}.$$

What does this tell us about how we need to set the batch size scheme $B_t$ for the left side of this inequality to be guaranteed to converge to 0?

Note that in practice if we're doing this sort of increasing batch-size scheme, we'd always want to do sampling without replacement, to take advantage of its lower variance as the batch size approaches the dataset size.

Questions?

Polyak–Ruppert Averaging¶

Intuition: SGD is converging to a "noise ball" where the iterates are randomly jumping around some space surrounding the optimum. We can think about these iterates as random samples that approximate the optimum.

What can we do when we have a bunch of random samples that approximate something to improve the precision of our estimate?

Technique: run regular SGD and just average the iterates. That is,

\begin{align*} w_{t+1} &= w_t - \alpha_t \nabla f_{\tilde i_t}(w_t) \\ \bar w_{t+1} &= \frac{t}{t + 1} \cdot \bar w_t + \frac{1}{t+1} \cdot w_{t+1}; \end{align*}

eventually we output the average $\bar w_T$ at the end of execution. This is equivalent to writing

$$\bar w_T = \frac{1}{T} \sum_{t=1}^T w_t.$$

The main idea here is that we're averaging out a bunch of iterations $w_t$ that are all in the noise ball. When you average out a bunch of noisy things, often you are able to reduce the noise.

Polyak Averaging: Intuition¶

We could prove Polyak averaging works for general convex problems. But that would be kinda involved, and doesn't deliver much intuition for why it makes sense. So, to maximize the relatability, we're going to go back to our old friend the one-dimensional quadratic, to get a sense of why this works. Suppose we have some dataset $u_1, \ldots, u_n \in \R$, and define the loss

$$f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} (w + u_i)^2.$$

Here, $\nabla f_i(w) = w + u_i$. For simplicity, suppose that

$$\frac{1}{n} \sum_{i=1}^n u_i = 0 \;\;\text{ and }\;\; \frac{1}{n} \sum_{i=1}^n u_i^2 = \sigma^2,$$

so that

$$f(w) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} (w^2 + 2 u_i w + u_i^2) = \frac{w^2 + \sigma^2}{2}.$$

Suppose that we run SGD on this with a constant learning rate $0 < \alpha \le 1$. Then,

\begin{align*}w_{t+1} &= w_t - \alpha \nabla f_{i_t}(w_t) \\&= w_t - \alpha w_t - \alpha u_{i_t} \\&= (1 - \alpha) w_t - \alpha u_{i_t}.\end{align*}

Applying this recursively, we get

\begin{align*}w_{t+1} &= (1 - \alpha) w_t - \alpha u_{i_t} \\&= (1 - \alpha)^2 w_{t-1} - \alpha (1 - \alpha) u_{i_{t-1}} - \alpha u_{i_t} \\&= (1 - \alpha)^3 w_{t-2} - \alpha (1 - \alpha)^2 u_{i_{t-2}} - \alpha (1 - \alpha) u_{i_{t-1}} - \alpha u_{i_t}, \end{align*}

so

$$w_T = \underbrace{(1 - \alpha)^T w_0}_{\text{GD-like term}} - \underbrace{\alpha \sum_{t = 0}^{T - 1} (1 - \alpha)^{T - 1 - t} u_{i_t}}_{\text{SGD noise term}}.$$

To tell how basic SGD compares with averaged SGD, we want to compare the expected loss-gap $f - f^*$ evaluated at both these points. For regular SGD:

\begin{align*} \Exv{f(w_T)} - f^* &= \frac{1}{2} \Exv{w_T^2} + \frac{\sigma^2}{2} - \frac{\sigma^2}{2} \\ &= \frac{1}{2} \Exv{ \left( (1 - \alpha)^T w_0 - \alpha \sum_{t = 0}^{T - 1} (1 - \alpha)^{T - 1 - t} u_{i_t} \right)^2 }\\ &= \frac{1}{2} \Var{ (1 - \alpha)^T w_0 - \alpha \sum_{t = 0}^{T - 1} (1 - \alpha)^{T - 1 - t} u_{i_t} } + \frac{1}{2} \Exv{ (1 - \alpha)^T w_0 - \alpha \sum_{t = 0}^{T - 1} (1 - \alpha)^{T - 1 - t} u_{i_t} }^2. \end{align*}

Since we assumed that $\Exv{u_{i_t}} = \frac{1}{n} \sum_{i=1}^n u_i = 0$, and since $w_0$ is a constant (not a random variable) and adding a constant to something doesn't change its variance, this reduces to

\begin{align*} \Exv{f(w_T)} - f^* &= \frac{1}{2} \Var{ - \alpha \sum_{t = 0}^{T - 1} (1 - \alpha)^{T - 1 - t} u_{i_t} } + \frac{1}{2} \Exv{ (1 - \alpha)^T w_0 }^2 \\ &= \frac{\alpha^2}{2} \sum_{t = 0}^{T - 1} (1 - \alpha)^{2(T - 1 - t)} \Var{ u_{i_t} } + \frac{1}{2} (1 - \alpha)^{2T} w_0^2 \\ &= \frac{\alpha^2}{2} \sum_{t = 0}^{T - 1} (1 - \alpha)^{2(T - 1 - t)} \sigma^2 + \frac{1}{2} (1 - \alpha)^{2T} w_0^2. \end{align*}

Next, if we let $k = T - 1 - t$, then summing from $t = 0$ to $T - 1$ is equivalent to summing from $k = 0$ to $T - 1$, so

\begin{align*} \Exv{f(w_T)} - f^* &= \frac{\alpha^2}{2} \sum_{k = 0}^{T - 1} (1 - \alpha)^{2k} \sigma^2 + \frac{1}{2} (1 - \alpha)^{2T} w_0^2 \\&= \frac{\alpha^2 \sigma^2}{2} \sum_{k = 0}^{T - 1} (1 - 2\alpha + \alpha^2)^{k} + \frac{1}{2} (1 - \alpha)^{2T} w_0^2 \\&= \frac{\alpha^2 \sigma^2}{2} \cdot \frac{1 - (1 - 2\alpha + \alpha^2)^T}{1 - (1 - 2\alpha + \alpha^2)} + \frac{1}{2} (1 - \alpha)^{2T} w_0^2 \\&= \frac{\alpha \sigma^2}{2 (2 - \alpha)} \left( 1 - (1 - 2\alpha + \alpha^2)^T \right) + \frac{1}{2} (1 - \alpha)^{2T} w_0^2. \end{align*}

Observe that in the limit as $T$ approaches infinity...

\begin{align*} \lim_{T \rightarrow \infty} \; \Exv{f(w_T)} - f^* &= \lim_{T \rightarrow \infty} \; \left( \frac{\alpha \sigma^2}{2 (2 - \alpha)} \left( 1 - (1 - 2\alpha + \alpha^2)^T \right) + \frac{1}{2} (1 - \alpha)^{2T} w_0^2 \right) \\ &= \frac{\alpha \sigma^2}{2 (2 - \alpha)}. \end{align*}

So this converges to a noise ball...which is what we expect for SGD.

Now, let's do the same analysis for the Polyak-averaged updates! With averaging,

\begin{align*} \bar w_T &= \frac{1}{T} \sum_{k = 0}^{T - 1} \left( (1 - \alpha)^k w_0 - \alpha \sum_{t = 0}^{k - 1} (1 - \alpha)^{k - 1 - t} u_{i_t} \right) \\&= \frac{1}{T} \left( \sum_{k = 0}^{T - 1} (1 - \alpha)^k w_0 - \alpha \sum_{k = 0}^{T - 1} \sum_{t = 0}^{k - 1} (1 - \alpha)^{k - 1 - t} u_{i_t} \right). \end{align*}

The pair of sums in the second term here results in the value $(1 - \alpha)^{k - 1 - t} u_{\tilde i_t}$ being summed up for all pairs $(k, t)$ that satisfy

$$0 \le k < T \;\;\text{ and }\;\; 0 \le t < k.$$

This condition is equivalent to $0 \le t < k < T$, and so we could also write it as

$$0 \le t < T - 1 \;\;\text{ and }\;\; t < k < T.$$

That is, these inequalities correspond to the same ordered pairs of $(k,t)$. As a result, we can re-order our summations above into \begin{align*} \bar w_T &= \frac{1}{T} \left( \sum_{k = 0}^{T - 1} (1 - \alpha)^k w_0 - \alpha \sum_{t = 0}^{T - 2} \sum_{k = t+1}^{T - 1} (1 - \alpha)^{k - 1 - t} u_{i_t} \right), \end{align*}

Next, if we substitute $l = k - 1 - t$ in the inner sum, then \begin{align*} \bar w_T &= \frac{1}{T} \left( \sum_{k = 0}^{T - 1} (1 - \alpha)^k w_0 - \alpha \sum_{t = 0}^{T - 2} \sum_{l = 0}^{T - t - 2} (1 - \alpha)^l u_{i_t} \right) \\ &= \frac{1}{T} \left( \frac{1 - (1 - \alpha)^T}{1 - (1 - \alpha)} \cdot w_0 - \alpha \sum_{t = 0}^{T - 2} \frac{1 - (1 - \alpha)^{T - t - 1}}{1 - (1 - \alpha)} \cdot u_{i_t} \right) \\ &= \frac{1}{T} \left( \frac{1 - (1 - \alpha)^T}{\alpha} \cdot w_0 - \sum_{t = 0}^{T - 2} \left( 1 - (1 - \alpha)^{T - t - 1} \right) \cdot u_{i_t} \right). \end{align*}

If we compute the expected loss gap of $f$ evaluated at this point, then as before,

\begin{align*} \Exv{ f(\bar w_T) } - f^* &= \frac{1}{2} \Exv{\bar w_T^2} = \frac{1}{2 T^2} \left( \left( \frac{1 - (1 - \alpha)^T}{\alpha} \right)^2 w_0^2 + \sum_{t = 0}^{T - 2} \left( 1 - (1 - \alpha)^{T - t - 1} \right)^2 \cdot \sigma^2 \right). \end{align*}

Now, since $1 - (1 - \alpha)^K \le 1$ and , this can be bounded by

\begin{align*} \Exv{ f(\bar w_T) } - f^* &\le \frac{1}{2 T^2} \left( \left( \frac{1}{\alpha} \right)^2 w_0^2 + \sum_{t = 0}^{T - 2} \sigma^2 \right) \\&= \frac{w_0^2}{2 \alpha^2 T^2} + \frac{\sigma^2 (T - 1)}{2T^2} \\&\le \frac{f(w_0) - f^*}{\alpha^2 T^2} + \frac{\sigma^2}{2T}. \end{align*}

Now, let's compare our rates!¶

For plain SGD, we had

$$\lim_{T \rightarrow \infty} \; \Exv{f(w_T)} - f^* = \frac{\alpha \sigma^2}{2 (2 - \alpha)}.$$

But for Polyak-averaged SGD, we had

$$\Exv{ f(\bar w_T) } - f^* \le \frac{f(w_0) - f^*}{\alpha^2 T^2} + \frac{\sigma^2}{2T}.$$

The averaged version is asymptotically going to 0 as $T$ increases!

That is, even with a constant learning rate $\alpha$, we can get loss values that are arbitrarily close to the actual optimal loss.
- Unfortunately, this is not guaranteed to happen for general convex losses, and only happens here because our loss is a quadratic function.
- But this does hint at an improvement from averaging that depends on how "similar" our loss function is to a quadratic.
On the other hand, the term that depends on the initialization $w_0$ (the first term) is no longer decaying exponentially.

Variants of Polyak Averaging¶

One issue with the simple setup we've used above: we are averaging with equal weight iterates from the very start of training, when we have not reached the noise ball.
To address this, we often run averaged SGD by first using an initial warm-up period during which we do not average, and then only starting to average after the warm-up period is over.
- Long warm-up periods (such as half of the total number of epochs ran) usually produce better results in practice than averaged SGD without any warmup.

For non-convex problems, we still can use averaging, but we run into the same issue that we ran into with AdaGrad: as we move through a non-convex landscape, we may get very far away from the parameter values we were at during previous iterations.

If this happens, averaging together with those parameter values doesn't really make sense, and can hurt the performance of our algorithm.

So, instead, a standard approach is to use an exponential moving average, just like was done to transform AdaGrad into RMSProp. That is, we pick some decay factor $0 < \rho < 1$ and run

\begin{align*} w_{t+1} &= w_t - \alpha_t \nabla f_{i_t}(w_t) \\ \bar w_{t+1} &= \rho \cdot \bar w_t + (1 - \rho) \cdot w_{t+1}. \end{align*}

This running average approach often outperforms other averaging methods in non-convex settings.

What is the computational overhead of running Polyak averaging? What is the memory overhead?

extra compute: O(d) per iteration, O(Td) total

extra memory: O(d)

Demo¶

Simple synthetic logistic regression demo.

# generate a naive bayes model
d = 10;
N = 10000;
wtrue = numpy.random.randn(d);
X = numpy.random.randn(N, d);
y = (1.0 / (1.0 + numpy.exp(-X @ wtrue)) >= numpy.random.rand(N)) * 2.0 - 1.0;

def grad(w, xi, yi):
    return -xi * yi / (1.0 + numpy.exp(yi * numpy.dot(xi, w)))

def sgd(w0, alpha, X, y, niters, wopt, warmup_steps):
    w = w0
    wsum = numpy.zeros_like(w0)
    wavg = numpy.zeros_like(w0)
    (N, d) = X.shape
    sgd_dist_to_optimum = numpy.zeros(niters)
    avg_dist_to_optimum = numpy.zeros(niters)
    for k in range(niters):
        i = numpy.random.randint(N)
        xi = X[i,:]
        yi = y[i]
        w = w - alpha * grad(w, xi, yi);
        sgd_dist_to_optimum[k] = numpy.linalg.norm(w - wopt);
        if k >= warmup_steps:
            wsum = wsum + w
            wavg = wsum / (k+1-warmup_steps)
            avg_dist_to_optimum[k] = numpy.linalg.norm(wavg - wopt);
        else:
            avg_dist_to_optimum[k] = numpy.linalg.norm(w0 - wopt);
    return (w, wavg, sgd_dist_to_optimum, avg_dist_to_optimum);

def svrg(w0, alpha, X, y, niters, nepochs, wopt):
    w = w0
    (N, d) = X.shape
    dist_to_optimum = numpy.zeros(niters * nepochs)
    for epi in range(nepochs):
        what = w;
        ghat = numpy.zeros(d);
        for i in range(N):
            xi = X[i,:];
            yi = y[i];
            ghat += grad(what, xi, yi);
        ghat /= N;
        for k in range(niters):
            i = numpy.random.randint(N)
            xi = X[i,:];
            yi = y[i];
            w = w - alpha * (grad(w, xi, yi) - grad(what, xi, yi) + ghat);
            dist_to_optimum[k + epi*niters] = numpy.linalg.norm(w - wopt);
    return (w, dist_to_optimum);

# pick some initial random weights
w0 = numpy.random.randn(d);

# solve the problem to a high degree of accuracy (by running a yet-to-be-seen method)
(wopt, _) = svrg(w0, 0.02, X, y, 10000, 100, wtrue)

# run SGD and Polyak averaging
(w, wavg, sgd_dist_to_optimum, avg_dist_to_optimum) = sgd(w0, 0.005, X, y, 25000, wopt, 10000)

pyplot.plot(sgd_dist_to_optimum, label="SGD")
pyplot.plot(avg_dist_to_optimum, label="Polyak averaging")
pyplot.legend()
pyplot.xlabel("iterations")
pyplot.ylabel("distance to optimum")
pyplot.ylim((-0.1,1.0))

(-0.1, 1.0)

Questions?

Variance Reduction¶

Main idea: combine gradient descent and SGD by using infrequent full-batch gradients to reduce the variance of SGD.

Premise: suppose that for some fixed parameter vector $\hat w \in \R^d$, I already know the full batch gradient $\nabla f(\hat w)$. How might I use this to reduce the variance of the gradient samples I use in SGD?

Observe that the expression $\nabla f_i(\hat w) - \nabla f(\hat w)$ has expected value 0.

This means that if we subtract this from our gradient samples, the resulting samples will still have the same expected value.

Concretely: use the update

$$w_{t+1} = w_t - \alpha \left( \nabla f_i(w_t) - \nabla f_i(\hat w) + \nabla f(\hat w) \right).$$

What can we say about the variance of this stochastic gradient sample? To bound it, we'll strengthen our assumptions a bit. Assume that the component loss functions $f_i$ are all $L$-smooth (rather than just the total loss $f$). Then, using the fact that $\norm{x + y}^2 \le 2 \norm{x}^2 + 2 \norm{y}^2$,

\begin{align*} &\hspace{-2em}\Exv{ \norm{\nabla f_i(w_t) - \nabla f_i(\hat w) + \nabla f(\hat w) - \nabla f(w_t)}^2 } \\&\le \Exv{ 2 \norm{\nabla f_i(w_t) - \nabla f_i(\hat w) }^2 + 2 \norm{ \nabla f(\hat w) - \nabla f(w_t)}^2 } \\&\le \Exv{ 2L \cdot \norm{w_t - \hat w}^2 + 2L \norm{ \hat w - w_t }^2 } \\&\le 4L \cdot \norm{w_t - \hat w}^2. \end{align*}

That is, the variance is proportional to the square of the distance to $\hat w$.

SVRG¶

Intuition: since the variance is small proportional to the distance between $w_t$ and $\hat w$, as long as we can make these closer as the algorithm runs, we'll decrease the variance to zero. One way we can ensure this is by computing new full-batch gradients at updated values of $\hat w$ periodically. If we choose to compute the full-batch gradients at the most recent weights we've computed, we get the stochastic variance reduced gradient algorithm, or SVRG.

Procedure SVRG.

Input: Full-gradient period $m$, initial weights $\hat w_0$.

Loop for $s \in \{0,1,\ldots\}$

Compute full gradient $\hat g_s \leftarrow \frac{1}{n} \sum_{i=1}^n \nabla f_i(\hat w_s) = \nabla f(\hat w_s)$
$w_{s,0} \leftarrow \hat w_s$
Loop for $t \in \{0,1,\ldots,m-1\}$
- Randomly pick $i \in \{1,\ldots,n\}$ uniformly
- Update $w_{s,t+1} \leftarrow w_{s,t} - \alpha \left( \nabla f_i(w_{s,t}) - \nabla f_i(\hat w_s) + g_s \right)$.
set $\hat w_{s+1} \leftarrow w_{s,m}$ in practice
- but, to prove theory, set $\hat w_{s+1} \leftarrow w_{s,t}$ for uniformly randomly chosen $t \in \{0,1,\ldots,m-1\}$

What is the computational cost of running $K$ total outer-loop iterations of SVRG? $O(K(n+m))$

...

Convergence rates of SVRG¶

Actually not too hard to prove, but it's a bit involved so we won't do it here (the proof is in the SVRG paper, if you're curious). But the resulting rate for $L$-smooth (for all $f_i$), $\mu$-strongly-convex (just needs to hold for $f$) objectives is:

$$\Exv{f(\hat w_s) - f^*} \le \left( \frac{1}{\mu \alpha (1 - 2 L \alpha) m} + \frac{2 L \alpha}{1 - 2 L \alpha} \right)^s \cdot \left( f(\hat w_0) - f^* \right).$$

Let's try to make this a little more interpretable by picking specific values for our hyperparameters $\alpha$ and $m$ such that

$$\frac{1}{\mu \alpha (1 - 2 L \alpha) m} \le \frac{2 L \alpha}{1 - 2 L \alpha} = \frac{1}{4}$$

so that the whole term will sum to $1/2$. It's fairly easy to see that this happens when

$$\alpha = \frac{1}{10 L} \;\;\text{and}\;\; m \ge \frac{50 L}{\mu} = 50 \kappa.$$

In this case, we'll have

$$\Exv{f(\hat w_s) - f^*} \le 2^{-s} \cdot \left( f(\hat w_0) - f^* \right).$$

So, in order to achieve a bound of $\Exv{f(\hat w_K) - f^*} \le \epsilon$, it suffices to run for $K$ total outer loop iterations, where

$$K \ge \log_2\left( \frac{f(\hat w_0) - f^*}{\epsilon} \right).$$

What is the big-$\mathcal{O}$ computational cost of running enough total iterations of SVRG to achieve a loss-gap guarantee of $\epsilon$ with these hyperparameter settings?

total compute cost = $O(K(n+m))$ = $\mathcal{O}\left( (n + \kappa) \log\left( \frac{f(\hat w_0) - f^*}{\epsilon} \right) \right)$

Demo¶

Let's look at how SVRG performs compared to SGD.

tic = time.time()
(_, dto_svrg) = svrg(w0, 0.02, X, y, 5000, 20, wopt);
print(f"SVRG time: {time.time() - tic} seconds")

tic = time.time()
(_, _, dto_sgd1, dto_avg1) = sgd(w0, 0.02, X, y, 5000 * 20, wopt, 40000);
print(f"SGD time: {time.time() - tic} seconds")

tic = time.time()
(_, _, dto_sgd2, dto_avg2) = sgd(w0, 0.002, X, y, 5000 * 20, wopt, 40000);
print(f"SGD time: {time.time() - tic} seconds")

SVRG time: 3.4920661449432373 seconds
SGD time: 2.571218252182007 seconds
SGD time: 2.2424581050872803 seconds

pyplot.semilogy(dto_svrg, label="SVRG, $\\alpha = 0.02$")
pyplot.semilogy(dto_sgd1, label="SGD, $\\alpha = 0.02$")
pyplot.semilogy(dto_avg1, label="Avg SGD, $\\alpha = 0.02$")
pyplot.semilogy(dto_sgd2, label="SGD, $\\alpha = 0.002$")
pyplot.semilogy(dto_avg2, label="Avg SGD, $\\alpha = 0.002$")
pyplot.legend()
pyplot.xlabel("inner iterations")
pyplot.ylabel("distance to optimum")
pyplot.title("SVRG vs SGD");

Take-away¶

SVRG can give us the best of both worlds of SGD and GD.

The linear rate of gradient descent on strongly convex problems (putting the target loss gap $\epsilon$ inside a log term).
The dependence on the condition number is no longer multiplied by the dataset size $n$ (from $n \kappa$ to $n + \kappa$).

Unfortunately, not very successful empirically on deep learning and often not a helpful tradeoff compared to plain SGD when we don't want very accurate solutions to the optimization problem (when $\epsilon$ is not very small).

...