Lecture 3: Scaling to complex models by learning with optimization algorithms.

Gradient descent, convex optimization and conditioning.

CS4787 — Principles of Large-Scale Machine Learning Systems

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\norm}[1]{\left\|#1\right\|}$

In [1]:
import numpy
import scipy
import matplotlib
import time

def hide_code_in_slideshow():   
    from IPython import display
    import binascii
    import os
    uid = binascii.hexlify(os.urandom(8)).decode()    
    html = """<div id="%s"></div>
    <script type="text/javascript">
            var p = $("#%s");
            if (p.length==0) return;
            while (!p.hasClass("cell")) {
                if (p.prop("tagName") =="body") return;
            var cell = p;
    </script>""" % (uid, uid)
    display.display_html(html, raw=True)

Review: The Empirical Risk

Suppose we have a dataset $\mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \}$, where $x_i \in \mathcal{X}$ is an example and $y_i \in \mathcal{Y}$ is a label. Let $h: \mathcal{X} \rightarrow \mathcal{Y}$ be a hypothesized model (mapping from examples to labels) we are trying to evaluate. Let $L: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ be a loss function which measures how different two labels are.

The empirical risk is $$ R(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i). $$ Most notions of error or accuracy in machine learning can be captured with an empirical risk. A simple example: measure the error rate on the dataset with the 0-1 Loss Function $$ L(\hat y, y) = 0 \text{ if } \hat y = y \text{ and } 1 \text{ if } \hat y \ne y. $$

What are other examples you've seen of empirical risk and loss functions?

Loss function:

  • Mean squared error
  • Hinge loss
  • Cross Entropy loss
  • Absolute error
  • Negative log likelihood
  • ROC


  • $\ell_2$ and $\ell_1$ regularization

The Cost of Computing the Empirical Risk

We need to compute the empirical risk a lot during training, both during validation (and hyperparameter optimization) and testing, so it's nice if we can do it fast.

Question: how does computing the empirical risk scale? Three things affect the cost of computation.

  • The number of training examples $n$. The cost will certainly be proportional to $n$.
  • The cost to compute the loss function $L$.
  • The cost to evaluate the hypothesis $h$.

Review: Empirical Risk Minimization

We don't just want to compute the empirical risk: we also want to minimize it. To do so, we parameterize the hypotheses using some parameters $w \in \R^d$. That is, we assign each hypothesis a $d$-dimensional vector of parameters and vice versa and solve the optimization problem \begin{align*} \mbox{minimize: }& R(h_w) = \frac{1}{n} \sum_{i=1}^n L(h_w(x_i), y_i) = f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) \\ \mbox{ over }& w \in \R^d \end{align*} where $h_w$ denotes the hypothesis associated with the parameter vector $w$. This is an instance of a principle of scalable ML: Write your learning task as an optimization problem, then solve it with an optimization algorithm.

Most optimization algorithms used in machine learning with with gradients...so let's make sure we're on the same page about gradients before we move on.


What is a gradient?

Suppose I have a function $f$ from $\R^d$ to $\R$. The gradient, $\nabla f$, is a function from $\R^d$ to $\R^d$ such that

$$\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta},$$

that is, it is the vector of partial derivatives of the function. Another, perhaps cleaner (and basis-independent), definition is that $\nabla f(w)^T$ is the linear map such that for any $u \in \R^d$

$$\nabla f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta} = \left. \frac{\partial}{\partial \alpha} f(w + \alpha u) \; \right|_{\alpha = 0}.$$

More informally, it is the unique vector such that $f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0) + \mathcal{o}( |w - w_0| )$ for $w$ nearby $w_0$.

Let's derive some gradients!

$f(x) = x^T A x$

$f(x) = \sum_{i=1}^d \sum_{j=1}^d A_{ij} x_i x_j$

$\left( \nabla f(x) \right)_k = \frac{\partial}{\partial x_k} f(x) = \sum_{j=1}^d A_{kj} x_j + \sum_{i=1}^d A_{ik} x_i$

$\nabla f(x) = \sum_{k=1}^d \left( \nabla f(x) \right)_k e_k$

$\nabla f(x) = \sum_{k=1}^d \left( \sum_{j=1}^d A_{kj} x_j + \sum_{i=1}^d A_{ik} x_i \right) e_k$

$\nabla f(x) = Ax + A^T x$

$\nabla^2 f(w) = A + A^T$

$(\nabla^2 f(w))^{-1} \nabla f(x) = x$

$$(x + y)^T z = x^T z + y^T z$$\begin{align*} f(x + \delta u) &= (x + \delta u)^T A (x + \delta u) \\&= x A (x + \delta u) + \delta u A (x + \delta u) \\&= x^T A x + \delta (u^T A x + u^T A^T x) + \delta^2 u^T A u \end{align*}
$$\lim_{\delta \rightarrow 0} \frac{x^T A x + \delta (u^T A x + u^T A^T x) + \delta^2 u^T A u - x^T A x}{\delta} = u^T A x + u^T A^T x$$

$f(x) = \| x \|_2^2 = \sum_{i=1}^d x_i^2 = x^T x = x^T I x$

$$\frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k} x_k^2 = 2 x_k$$$$\nabla f(x) = 2 x$$

$f(x) = \| x \|_1 = \sum_{i=1}^d | x_i |$

$$\frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k} |x_k| = \operatorname{sign}(x_k)$$$$\nabla f(x) = \operatorname{sign}(x),$$

where sign operates elementwise

$f(x) = \| x \|_{\infty} = \max(|x_1|, |x_2|, \ldots, |x_d|)$

$$\frac{\partial f}{\partial x_k} = \begin{cases} 0 & | x_k | < \| x \|_{\infty} \\ \operatorname{sign}(x_k) & | x_k | = \| x \|_{\infty} \end{cases}$$$$\nabla f(x) = \operatorname{sign}(x) \cdot 1_{|x_k| = \|x\|_\infty}$$

If $x = \begin{bmatrix} -2 \\ 1 \end{bmatrix}$

$$\nabla f(x) = \begin{bmatrix} \operatorname{sign}(-2) \\ 0 \end{bmatrix} = \begin{bmatrix} -1 \\ 0 \end{bmatrix}.$$

Higher-order derivatives

Gradients have natural generalizations to second- and higher-order derivatives. The most common one to see in ML is the Hessian, which is the matrix of second-partial derivatives.

Suppose I have a function $f$ from $\R^d$ to $\R$. The Hessian, $\nabla^2 f$, is a function from $\R^d$ to $\R^{d \times d}$ such that

$$\left(\nabla^2 f(w) \right)_{ij} = \frac{\partial^2}{\partial w_i \partial w_j} f(w).$$

More informally, it is the unique matrix such that $$f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0) + \frac{1}{2} (w-w_0)^T \nabla^2 f(w_0) (w-w_0) + \mathcal{o}( |w - w_0|^2 )$$ for $w$ nearby $w_0$.

Gradient descent (GD).

Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by running $$ w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{i=1}^n \nabla f_i(w) $$ where $w_t$ is the value of the parameter vector at time $t$, $\alpha_t$ is a parameter called the learning rate or step size, and $\nabla f$ denotes the gradient (vector of partial derivatives) of $f$.

Gradient descent is not the only classic optimization method though...

Another iterative method of optimization: Newton's method.

Initialize the parameters at some value $w_0 \in \R^d$, and decrease the value of the empirical risk iteratively by running $$ w_{t+1} = w_t - \left( \nabla^2 f(w_t) \right)^{-1} \nabla f(w_t). $$ Newton's method can converge more quickly than gradient descent, because it is a second-order optimization method (it uses the second derivative of $f$) whereas gradient descent is only a first-order method (it uses the first derivative of $f$).

Question: what is the computational cost of running a single step of gradient descent and Newton's method? Not including the training set, how much memory is required?

Suppose that computing gradients of the examples takes $\Theta(d)$ time and computing Hessians takes $\Theta(d^2)$ time, and express your answer in terms $n$ and $d$.

Gradient descent: $ w_{t+1} = w_t - \alpha_t \cdot \nabla f(w_t) = w_t - \alpha_t \cdot \frac{1}{n} \sum_{i=1}^n \nabla f_i(w) $

Time: $O(nd)$

Memory: $O(d)$

Newton's method: $ w_{t+1} = w_t - \left( \nabla^2 f(w_t) \right)^{-1} \nabla f(w_t). $

Time: $O(nd^2 + d^3) = O(nd + nd^2 + d^3 + d^2 + d)$

Memory: $O(d^2)$

Why does gradient descent work?

...and what does it mean for it to work?

Intuitively, GD decreases the value of the objective at each iteration, as long as the learning rate is small enough and the value of the gradient is nonzero. Eventually, this should result in gradient descent coming close to a point where the gradient is zero. We can prove that gradient descent works under the assumption that the second derivative of the objective is bounded (this is sometimes called an $L$-smoothness bound).

  • There are other assumptions we could use here also, but this is the simplest one.

Suppose that for some constant $L > 0$, for all $x$ and $y$ in $\R^d$,

$$ \| \nabla f(x) - \nabla f(y) \| \le L \| x - y \|. $$

Here, $\| \cdot \|$ denotes the $\ell_2$ norm: $\|u\| = \sqrt{\sum_{i=1}^d u_i^2}$.

  • This is equivalent to the condition that $\| \nabla^2 f(x) \|_2 \le L$ (where the norm here denotes the induced norm) if the function is twice-differentiable.

Starting from this condition, let's look at how the objective changes over time as we run gradient descent with a fixed step size.

\begin{align*} f(w_{t+1}) &= f(w_t - \alpha \nabla f(w_t)). \end{align*}

Let's consider this expression as a function of $\alpha$. Call this $\rho(\alpha) = f(w_t - \alpha \nabla f(w_t))$. Observe that

$$\rho'(\alpha) = \frac{\partial}{\partial \alpha} f(w_t - \alpha \nabla f(w_t)) = -\nabla f(w_t - \alpha \nabla f(w_t))^T \nabla f(w_t).$$

Now, by the fundamental theorem of calculus,

$$ \rho(\alpha) - \rho(0) = \int_0^\alpha \rho'(\eta) \; d \eta. $$

If we substitute back in our definition of $\rho$, we get

$$f(w_{t+1}) = f(w_t - \alpha \nabla f(w_t)) = f(w_t) - \int_0^{\alpha} \nabla f(w_t - \eta \nabla f(w_t))^T \nabla f(w_t) \; d \eta.$$

We can now simplify this as follows. (Keep in mind: $\| \nabla f(x) - \nabla f(y) \| \le L \| x - y \|$).

\begin{align*} f(w_{t+1}) &= f(w_t) - \int_0^{\alpha} \nabla f(w_t - \eta \nabla f(w_t))^T \nabla f(w_t) \; d \eta. \\&= f(w_t) - \alpha \nabla f(w_t)^T \nabla f(w_t) - \int_0^{\alpha} \left( \nabla f(w_t - \eta \nabla f(w_t)) - \nabla f(w_t) \right)^T \nabla f(w_t) \; d \eta. \\&= f(w_t) - \alpha \| \nabla f(w_t) \|^2 + \int_0^{\alpha} \left( \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t)) \right)^T \nabla f(w_t) \; d \eta. \\&\le f(w_t) - \alpha \| \nabla f(w_t) \|^2 + \int_0^{\alpha} \left\| \nabla f(w_t) - \nabla f(w_t - \eta \nabla f(w_t)) \right\| \cdot \left| \nabla f(w_t) \right| \; d \eta. \end{align*}

Where the last line follows from $x^T y \le \| x \| \cdot \| y \|$. Now by our smoothness bound, \begin{align*} f(w_{t+1}) &\le f(w_t)

  • \alpha | \nabla f(w_t) |^2
  • \int_0^{\alpha} L \cdot \left| (w_t) - (w_t - \eta \nabla f(w_t)) \right| \cdot \left| \nabla f(w_t) \right| \; d \eta \&= f(w_t)
  • \alpha | \nabla f(w_t) |^2
  • L | \nabla f(w_t) |^2 \cdot \int_0^{\alpha} \eta \; d \eta \&\le f(w_t)
  • \alpha | \nabla f(w_t) |^2
  • \frac{1}{2} \alpha^2 L | \nabla f(w_t) |^2 \&\le f(w_t)
  • \alpha \left(1 - \frac{1}{2} \alpha L \right) \cdot | \nabla f(w_t) |^2. \end{align*}

Note that there is a somewhat easier way of getting to this result using Taylor's theorem rather than FTC; this easier approach is the one I use in the notes. I'm using this more-long-winded way of getting there to make sure that everyone gets the intuition.

So, we got to

$$f(w_{t+1}) \le f(w_t) - \alpha \left(1 - \frac{1}{2} \alpha L \right) \cdot \| \nabla f(w_t) \|^2.$$

If we choose our step size $\alpha$ to be small enough that $1 \ge \alpha L$, then this simplifies to

$$f(w_{t+1}) \le f(w_t) - \frac{\alpha}{2} \cdot \| \nabla f(w_t) \|^2.$$

That is, the objective is guaranteed to decrease at each iteration! This matches our intuition for why gradient descent should work.

Now, if we sum this up across $T$ iterations of gradient descent, we get

$$ \frac{1}{2} \alpha \sum_{t = 0}^{T-1} \norm{ \nabla f(w_t) }^2 \le \sum_{t = 0}^{T-1} \left( f(w_t) - f(w_{t+1}) \right) = f(w_0) - f(w_T) \le f(w_0) - f^* $$

where $f^*$ is the global minimum value of the loss function $f$. From here, we can get

$$ \min_{t \in \{0, \ldots, T\}} \norm{ \nabla f(w_t) }^2 \le \frac{1}{T} \sum_{t = 0}^{T-1} \norm{ \nabla f(w_t) }^2 \le \frac{2 (f(w_0) - f^*)}{\alpha T}. $$

This means that the smallest gradient we observe after $T$ iterations is getting smaller proportional to $1/T$. So gradient descent converges...

  • as long as we look at the smallest observed gradient
  • and we care about finding a point where the gradient is small
Question: does this ensure that gradient descent converges to the global optimum (i.e. the value of $w$ that minimizes $f$ over all $w \in \R^d$)?


When can we ensure that gradient descent does converge to the unique global optimum?

Certainly, we can do this when there is only one global optimum.

The simplest case of this is the case of convex objectives.

Convex Functions

A function $f: \R^d \rightarrow \R$ is convex if for all $x, y \in \R^d$, and all $\eta \in [0,1]$

$$f(\eta x + (1 - \eta) y) \le \eta f(x) + (1 - \eta) f(y).$$

What this means graphically is that if we draw a line segment between any two points in the graph of the function, that line segment will lie above the function. There are a bunch of equivalent conditions that work if the function is differentiable. In terms of the gradient, for all $x, y \in \R^d$,

$$(x - y)^T \left( \nabla f(x) - \nabla f(y) \right) \ge 0,$$

and in terms of the Hessian, for all $x \in \R^d$ and all $u \in \R^d$

$$u^T \nabla^2 f(x) u \ge 0;$$

this is equivalent to saying that the Hessian is positive semidefinite.

Question: What are some examples of hypotheses and loss functions that result in a convex objective??


An even easier case than convexity: strong convexity.

A function is strongly convex with parameter $\mu > 0$ if for all $x, y \in \R^d$,

$$f(y) \ge f(x) + \nabla f(x)^T (y - x) + \frac{\mu}{2} \norm{x - y}^2$$

or equivalently, for all $x \in \R^d$ and all $u \in \R^d$

$$u^T \nabla^2 f(x) u \ge \mu \norm{u}^2.$$

One (somewhat weaker) condition that is implied by strong convexity is

$$\norm{\nabla f(x)}^2 \ge 2 \mu \left( f(x) - f^* \right);$$

this is sometimes called the Polyak-Lojasiewicz condition.

  • A useful note: you can make any convex objective strongly convex by adding $\ell_2$ regularization.

Gradient descent on strongly convex objectives.

As before, let's look at how the objective changes over time as we run gradient descent with a fixed step size.

  • This is a standard approach when analyzing an iterative algorithm like gradient descent. From our proof before, we had (as long as $1 \ge \alpha L$)
$$f(w_t) \le f(w_t) - \frac{\alpha}{2} \cdot \| \nabla f(w_t) \|^2.$$

Now applying the Polyak-Lojasiewicz condition condition to this gives us

$$ f(w_{t+1}) \le f(w_t) - \alpha \mu \left( f(w_t) - f^* \right).$$

Subtracting the global optimum $f^*$ from both sides produces

$$ f(w_{t+1}) - f^* \le f(w_t) - f^* - \alpha \mu \left( f(w_t) - f^* \right) = (1 - \alpha \mu) \left( f(w_t) - f^* \right). $$

So we got that

$$f(w_{t+1}) - f^* \le (1 - \alpha \mu) \left( f(w_t) - f^* \right).$$

But this means that

\begin{align*} f(w_{t+2}) - f^* &\le (1 - \alpha \mu) \left( f(w_{t+1}) - f^* \right) \\&\le (1 - \alpha \mu) \left( (1 - \alpha \mu) \left( f(w_t) - f^* \right) \right) \\&\le (1 - \alpha \mu)^2 \left( f(w_t) - f^* \right). \end{align*}

Applying this inductively, after $K$ total steps

$$f(w_K) - f^* \le (1 - \alpha \mu)^K \left( f(w_0) - f^* \right) \le \exp(-\alpha \mu K) \cdot \left( f(w_0) - f^* \right).$$

This shows that, for strongly convex functions, gradient descent with a constant step size converges exponentially quickly to the optimum. This is sometimes called convergence at a linear rate.

If we use the largest step size that satisfies our earlier assumption that $1 \ge \alpha L$ (i.e. $\alpha = 1/L$), then this rate becomes

$$ f(w_K) - f^* \le \exp\left(-\frac{\mu K}{L} \right) \cdot \left( f(w_0) - f^* \right).$$

Equivalently, in order to ensure that $f(w_K) - f^*$ for some target error $\epsilon > 0$, it suffices to set the number of iterations $K$ large enough that

$$K \ge \frac{L}{\mu} \cdot \log\left( \frac{f(w_0) - f^*}{\epsilon} \right).$$

The main thing that affects how large $K$ needs to be here is the ratio $L / \mu$, which is a function just of the problem itself and not of how we initialize or how accurate we want our solutions to be. We call the ratio

$$\kappa = \frac{L}{\mu}$$

the condition number of the problem. The condition number encodes how hard a strongly convex problem is to solve.

  • Observe that this ration is invariant to scaling of the objective function $f$.

When the condition number is very large, the problem can be very hard to solve, and take many iterations of gradient descent to get close to the optimum. Even if the cost of running gradient descent is tractible as we scale, if scaling a problem causes the condition number to blow up, this can cause issues!

What affects the condition number? What can we do to make the condition number smaller?
In [ ]: