We want to minimize a

In this section we discuss two of the most popular "hill-climbing" algorithms, gradient descent and Newton's method.

Initialize $\mathbf{w}_0$

Repeat until converge:

$\mathbf{w}^{t+1}$ = $\mathbf{w}^t$ + $\mathbf{s}$

If $\|\mathbf{w}^{t+1}$ - $\mathbf{w}^t\|_2$ < $\epsilon$,

How can you minimize a function $\ell$ if you don't know much about it? The trick is to assume it is much simpler than it really is. This can be done with Taylor's approximation. Provided that the norm $\|\mathbf{s}\|_2$is small (

$\ell(\mathbf{w} + \mathbf{s})$ $\approx$ $\ell(\mathbf{w})$ + $g(\mathbf{w})$ $^\top$ $\mathbf{s}$ + $\frac{1}{2}$$\mathbf{s}^\top H(\mathbf{w})\mathbf{s}$

Here, $g(\mathbf{w})=\nabla\ell(\mathbf{w})$ is the gradient and $H(\mathbf{w})=\nabla^{2}\ell(\mathbf{w})$ is the Hessian of $\ell$. Both approximations are valid if $\|\mathbf{s}\|_2$ is small, but the second one assumes that $\ell$ is

In gradient descent we only use the gradient (first order). In other words, we assume that the function $\ell$ around $\mathbf{w}$ is linear and behaves like $\ell(\mathbf{w}) + g(\mathbf{w})^\top\mathbf{s}$. Our goal is to find a vector $\mathbf{s}$ that minimizes this function. In steepest descent we simply set

for some small $\alpha$>0. It is straight-forward to prove that in this case $\ell(\mathbf{w}+\mathbf{s})<\ell(\mathbf{w})$.

Setting the

Initialize $\vec w_0$ and $\vec z$: $\forall d$: $w^0_d=0$ and $z_d=0$

Repeat until converge:

$\mathbf{g}=\frac{\partial f(\mathbf{w})}{\partial \mathbf{w}}$ # Compute gradient

$\forall d$: $z_{d}\leftarrow z_{d}+g_{d}^2$

$\forall d$: ${w}_d^{t+1}\leftarrow {w}_d^t-\alpha \frac{{g}_d}{\sqrt{{z}_d+\epsilon}}$

If $\|\mathbf{w}^{t+1}$ - $\mathbf{w}^t\|_2$ < $\delta$,

Newton's method assumes that the loss $\ell$ is

It follows that the approximation

describes a convex parabola, and we can find its minimum by solving the following optimization problem:

To find the minimum of the objective, we take its first derivative and equate it with zero and solve for $\mathbf{s}$,

$\Rightarrow$$\mathbf{s}$= $-[H(\mathbf{w})]^{-1}g(\mathbf{w})$.

2. To avoid divergence of Newton's method, a good approach is to start with gradient descent (or even stochastic gradient descent) and then finish the optimization Newton's method. Typically, the second order approximation, used by Newton's Method, is more likely to be appropriate near the optimum.

A comparison of Newton's Method and Gradient Descent. Gradient Descent always converges after over 100 iterations from all initial starting points. If it converges (Figure 1), Newton's Method is much faster (convergence after 8 iterations) but it can diverge (Figure 2). Figure 3 shows the hybrid approach of taking 6 gradient descent steps and then switching to Newton's Method. It still converges in only 10 updates.