lecture7

Recall: from last time, we showed that...¶

If we assume that for all $x, y \in \R^d$, $\| \nabla f(x) - \nabla f(y) \| \le L \cdot \| x - y \|$, then

$$f(w_{t+1}) \le f(w_t) - \alpha \left(1 - \frac{1}{2} \alpha L \right) \cdot \| \nabla f(w_t) \|^2.$$

If we choose our step size $\alpha$ to be small enough that $1 \ge \alpha L$, then this simplifies to

$$f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \cdot \| \nabla f(w_t) \|^2.$$

That is, the objective is guaranteed to decrease at each iteration! This matches our intuition for why gradient descent should work.

An aside...why do we need sufficiently small step size?¶

With too large a step size, we can overshoot the optimum!

Consider the following case of $f(w) = \frac{1}{2} w^2$. Here $f'(w) = \nabla f(w) = w$ (and $f''(w) = 1$), so it's $L$-smooth with $L = 1$. Suppose we're at $w_t = 2$ as shown here:

In [2]:

x = numpy.arange(-2,3,0.01)
y = x**2 / 2
pyplot.plot(x,y);
pyplot.scatter([2.0],[2.0**2/2], c="r", zorder=10)

Out[2]:

<matplotlib.collections.PathCollection at 0x7fa1d2778320>

Here our GD update will be $w_{t+1} = w_t - \alpha f(w_t) = 2 - \alpha \cdot 2$. If we step with $\alpha = 1$, we go directly to the optimum at $w = 0$. But! If we step with a larger $\alpha$, we overshoot, and for $\alpha > 2$ our loss $f(w)$ actually increases. This illustrates why having sufficiently small steps is necessary for our proof.

Why is our L-smoothness assumption necessary?¶

We can also use a simple example to illustrate why assuming some sort of smoothness is necessary to show this result. Consider the (dumb) example of $f(w) = \sqrt{|w|}$.

In [3]:

x = numpy.arange(-3,3,0.01)
y = numpy.sqrt(numpy.abs(x))
w = 2;
prev_w = numpy.array(w)
alpha = 0.5;

In [22]:

for it in range(100):
    w = w - alpha * numpy.sign(w)/(2*numpy.sqrt(numpy.abs(w)))
    prev_w = numpy.append(prev_w,w)
pyplot.plot(x,y);
pyplot.scatter(prev_w, numpy.sqrt(numpy.abs(prev_w)), [30], c="g", zorder=100)
pyplot.scatter([w], [numpy.sqrt(numpy.abs(w))], [100], c="r", zorder=100)
prev_w = numpy.append(prev_w,w)

In [23]:

pyplot.plot(prev_w)

Out[23]:

[<matplotlib.lines.Line2D at 0x7fa1d286f208>]

Stochastic Gradient Descent¶

Basic idea: in gradient descent, just replace the full gradient (which is a sum) with a single gradient example. Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by sampling a random example $x_{(t)}$ uniformly from the training set and then updating

$w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t; x_{(t)})$

where as usual $w_t$ is the value of the parameter vector at time $t$, $\alpha_t$ is the learning rate or step size, and $\nabla f_i$ denotes the gradient of the loss function of the $i$th training example. Compared with gradient descent and Newton's method, SGD is simple to implement and runs each iteration faster.

A potential objection!¶

This is not necessarily going to be decreasing the loss at every step!

Because we're just moving in a direction that will decrease the loss for one particular example: this won't necessarily decrease the total loss!
So we can't demonstrate convergence by using a proof like the one we used for gradient descent, where we showed that the loss decreases at every iteration of the algorithm.
The fact that SGD doesn't always improve the loss at each iteration motivates the question: does SGD even work? And if so, why does SGD work?

Understanding why SGD converges¶

Assumption: $f$ is $L$-smooth, i.e. for any $x$ and $y$ $$\| \nabla f(x) - \nabla f(y) \| \le L \| x - y \|.$$

Assumption: loss $f$ is non-negative. (This is without loss of generality if $f$ is bounded from below, since we can always add a constant to $f$ in that case to make it non-negative.)

New Assumption for SGD: the variance of the gradients is bounded. For some constant $\sigma > 0$, if $x$ is drawn uniformly at random from the training set, for any $w$ $$\mathbf{E}_x\left[ \| \nabla f(w; x) - \nabla f(w) \|^2 \right] \le \sigma^2,$$ or equivalently, $$\frac{1}{n} \sum_{x \in \mathcal{D}} \| \nabla f(w; x) - \nabla f(w) \|^2 \le \sigma^2,$$ or also equivalently, $$\mathbf{E}_x\left[ \| \nabla f(w; x) \|^2 \right] \le \| \nabla f(w) \|^2 + \sigma^2.$$

As before, we consider what happens at the next time step. \begin{align*} f(w_{t+1}) &= f(w_t - \alpha_t \nabla f(w_t; x_{(t)})) \\&= f(w_t) + \int_{0}^{\alpha_t} \frac{\partial}{\partial \eta} f(w_t - \eta \nabla f(w_t; x_{(t)})) \; d \eta \\&= f(w_t) + \int_{0}^{\alpha_t} \left( - \nabla f(w_t; x_{(t)}) \right)^T \nabla f(w_t - \eta \nabla f(w_t; x_{(t)})) \; d \eta \\&= f(w_t) -\int_{0}^{\alpha_t} \left( \nabla f(w_t; x_{(t)})^T \nabla f(w_t) - \nabla f(w_t; x_{(t)})^T \left( \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t; x_{(t)})) \right) \right) \; d \eta \\&= f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \int_{0}^{\alpha_t} \nabla f(w_t; x_{(t)})^T \left( \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t; x_{(t)})) \right) \; d \eta \\&\le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \int_{0}^{\alpha_t} \| \nabla f(w_t; x_{(t)}) \| \cdot \| \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t; x_{(t)})) \| \; d \eta \\&\le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \int_{0}^{\alpha_t} \| \nabla f(w_t; x_{(t)}) \| \cdot L \| w_t - (w_t - \eta \nabla f(w_t; x_{(t)})) \| \; d \eta \\&\le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \int_{0}^{\alpha_t} \| \nabla f(w_t; x_{(t)}) \| \cdot L \| \eta \nabla f(w_t; x_{(t)}) \| \; d \eta \\&\le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + L \| \nabla f(w_t; x_{(t)}) \|^2 \int_{0}^{\alpha_t} \eta \; d \eta \\&\le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \frac{\alpha_t^2 L}{2} \| \nabla f(w_t; x_{(t)}) \|^2. \end{align*}

So, $$f(w_{t+1}) \le f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \frac{\alpha_t^2 L}{2} \| \nabla f(w_t; x_{(t)}) \|^2.$$ Now, this isn't a descent method, but this is going to tend to be decaying in expectation. Let's take the expected value of both sides conditioned on $w_t$. The randomness here is over the random choice of $x_{(t)}$. $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le \mathbf{E}\left[ f(w_t) -\alpha_t \nabla f(w_t; x_{(t)})^T \nabla f(w_t) + \frac{\alpha_t^2 L}{2} \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right].$$ By linearity of expectation and by factoring out constants that don't depend on the random choice of $x_{(t)}$, $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) -\alpha_t \mathbf{E}[ \nabla f(w_t; x_{(t)}) \mid w_t ]^T \nabla f(w_t) + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right].$$ Now, the first expected value is $$\mathbf{E}[ \nabla f(w_t; x_{(t)}) \mid w_t ] = \frac{1}{n} \sum_{x \in \mathcal{D}} \nabla f(w_t; x) = \nabla f(w_t).$$ So we get $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \nabla f(w_t)^T \nabla f(w_t) + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right]$$ which is equivalent to $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 L}{2} \mathbf{E}\left[ \| \nabla f(w_t; x_{(t)}) \|^2 \,\middle|\, w_t \right].$$ We can also apply our assumption of a bound on the variance of the stochastic gradients to this second expected value, which yields $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 L}{2} \left( \| \nabla f(w_t) \|^2 + \sigma^2 \right).$$ This simplifies to $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \alpha_t \left( 1 - \frac{\alpha_t L}{2} \right) \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 \sigma^2 L}{2}.$$ If we constrain the step size to be sufficiently small, $\alpha_t L \le 1$, then this can be further simplified to $$\mathbf{E}[ f(w_{t+1}) \mid w_t] \le f(w_t) - \frac{\alpha_t}{2} \| \nabla f(w_t) \|^2 + \frac{\alpha_t^2 \sigma^2 L}{2}.$$ Finally, let's take the full expected value over all the randomness in the algorithm, not just this conditional expectation. By the law of total expectation, this yields $$\mathbf{E}[ f(w_{t+1}) ] = \mathbf{E}[ \mathbf{E}[ f(w_{t+1}) \mid w_t] ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

Rearranging the above expression, we can get $$\frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$ Now, let's imagine that we run $K$ total iterations of SGD, and we sum both sides of this expression going from $t$ from $0$ to $K-1$. This yields $$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le \left( \sum_{t=0}^{K-1} \left( \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] \right) \right) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$ Observe that this sum telescopes! So we get $$\sum_{t=0}^{K-1} \left( \mathbf{E}[ f(w_t) ] - \mathbf{E}[ f(w_{t+1}) ] \right) = \mathbf{E}[ f(w_0) ] - \mathbf{E}[ f(w_K) ] = f(w_0) - \mathbf{E}[ f(w_K) ] \le f(w_0),$$ where this last inequality follows from our assumption that the loss is non-negative. Thus, $$\sum_{t=0}^{K-1} \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$ Now, let $$\rho_t = \frac{\alpha_t}{\sum_{k = 0}^{K-1} \alpha_t}.$$ We can show that $$\left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \sum_{t=0}^{K-1} \rho_t \mathbf{E}[ \| \nabla f(w_t) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2.$$ But if we let $\tau$ be a random varible with distribution given by $\rho$ (this is a distribution because it sums to $1$), then we can write this left side in expected-value form as $$\left( \sum_{t=0}^{K-1} \frac{\alpha_t}{2} \right) \cdot \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le f(w_0) + \frac{\sigma^2 L}{2} \sum_{t=0}^{K-1} \alpha_t^2,$$ where now the expected value is taken over both the algorthmic randomness and the choice of $\tau$. We can think of $w_{\tau}$ as the output of SGD run for a random number of steps.

Constant Step Size¶

Here, we have $\alpha_t = \alpha$ $$\frac{\alpha K}{2} \cdot \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le f(w_0) + \frac{\alpha^2 \sigma^2 L K}{2}.$$ Multiplying both sides by $\frac{2}{\alpha K}$, $$\mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] \le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L.$$

How should we interpret this?¶

SGD with constant step size converges to a "noise ball" (a.k.a. "noise floor")! Gradient magnitude doesn't necessarily go to zero.

Even if we run for a very large number of iterations, $$\lim_{K \rightarrow \infty} \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L = \alpha \sigma^2 L \ne 0.$$ For many applications this is fine...but it seems somehow lacking.

We can make this arbitrarily small¶

By choosing the step size constant as a function of the number of steps $K$ we plan to run. Set $$\alpha = \min\left( \sqrt{\frac{2 f(w_0)}{\sigma^2 L K}}, \frac{1}{L} \right).$$ You can derive this setting for yourself by minimizing this expression with respect to $\alpha$. Then since $\min(x,y) \le x$ and $\max(x,y) \le x + y$, \begin{align*} \mathbf{E}[ \| \nabla f(w_\tau) \|^2 ] &\le \frac{2 f(w_0)}{\alpha K} + \alpha \sigma^2 L \\&= \frac{2 f(w_0)}{K} \max\left( \sqrt{\frac{\sigma^2 L K}{2 f(w_0)}}, L \right) + \sigma^2 L \min\left( \sqrt{\frac{2 f(w_0)}{\sigma^2 L K}}, \frac{1}{L} \right) \\&\le \frac{2 f(w_0)}{K} \left( \sqrt{\frac{\sigma^2 L K}{2 f(w_0)}} + L \right) + \sigma^2 L \cdot \frac{2 f(w_0)}{\sigma^2 L K} \\&= \sqrt{\frac{2 f(w_0) \sigma^2 L}{K}} + \frac{2 L f(w_0)}{K} + \sqrt{\frac{2 f(w_0)\sigma^2 L}{K}} \\&= \sqrt{\frac{8 f(w_0)\sigma^2 L}{K}} + \frac{2 L f(w_0)}{K}. \end{align*} This is decreasing to zero as $K$ increases.

Under strong convexity, we can do better¶

Recall that we had $$\mathbf{E}[ f(w_{t+1}) ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ \| \nabla f(w_t) \|^2 ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$ If $f$ is $mu$-strongly convex, we have that $\norm{\nabla f(x)}^2 \ge 2 \mu \left( f(x) - f^* \right)$, and so $$\mathbf{E}[ f(w_{t+1}) ] \le \mathbf{E}[ f(w_t) ] - \frac{\alpha_t}{2} \mathbf{E}[ 2 \mu \left( f(w_t) - f^* \right) ] + \frac{\alpha_t^2 \sigma^2 L}{2}$$ and so, subtracting the global minimum $f^*$ from both sides, $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le \mathbf{E}[ f(w_t) - f^* ] - \alpha_t \mu \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha_t^2 \sigma^2 L}{2},$$ which simplifies to $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le (1 - \alpha_t \mu) \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha_t^2 \sigma^2 L}{2}.$$

Again supposing a constant step size, $$\mathbf{E}[ f(w_{t+1}) - f^* ] \le (1 - \alpha \mu) \cdot \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha^2 \sigma^2 L}{2}.$$ Subtracting $\frac{\alpha \sigma^2 L}{2 \mu}$ from both sides, \begin{align*} \mathbf{E}\left[ f(w_{t+1}) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right] &\le (1 - \alpha \mu) \cdot \mathbf{E}[ f(w_t) - f^* ] + \frac{\alpha^2 \sigma^2 L}{2} - \frac{\alpha \sigma^2 L}{2 \mu} \\&\le (1 - \alpha \mu) \cdot \mathbf{E}\left[ f(w_t) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right]. \end{align*} We can now apply this recursively! Over $K$ total iterations, $$\mathbf{E}\left[ f(w_K) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right] \le (1 - \alpha \mu)^K \cdot \left( f(w_0) - f^* - \frac{\alpha \sigma^2 L}{2 \mu} \right) \le \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right),$$ where we were able to drop the expected value on the right because $w_0$ is not a random variable. This yields a final bound of $$\mathbf{E}\left[ f(w_K) - f^* \right] \le \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) + \frac{\alpha \sigma^2 L}{2 \mu}.$$

Again we have the same issue of convergence to a noise ball for constant $\alpha$. We can minimize this over $\alpha$ to pick a number-of-steps-dependent step size. $$0 = -\mu K \exp(- \alpha \mu K) \cdot \left( f(w_0) - f^* \right) + \frac{\sigma^2 L}{2 \mu} \rightarrow \alpha = \frac{1}{\mu K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right).$$ This yields \begin{align*} \mathbf{E}\left[ f(w_K) - f^* \right] &\le \frac{\sigma^2 L}{2 \mu^2 \left( f(w_0) - f^* \right) K} \cdot \left( f(w_0) - f^* \right) + \frac{\sigma^2 L}{2 \mu} \cdot \frac{1}{\mu K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right) \\&= \frac{\sigma^2 L}{2 \mu^2 K} + \frac{\sigma^2 L}{2 \mu^2 K} \log\left( \frac{2 \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right) \\&= \frac{\sigma^2 L}{2 \mu^2 K} \log\left( \frac{2e \mu^2 \left( f(w_0) - f^* \right) K}{\sigma^2 L} \right). \end{align*} This is indeed approaching $0$ as $K$ becomes large!

Lecture 7: Gradient descent continued, and stochastic gradient descent¶

CS4787/5777 — Principles of Large-Scale Machine Learning Systems¶