# 3Calculus, Optimization, Analysis

I assume no refresher is needed for most single-variable calculus. But it is useful to have some facts about multi-variable calculus and a few results from mathematical analysis at hand when designing and analyzing numerical methods, and the reader may be forgiven for not remembering all of this in the notation that I find most comfortable.

## 3.1 Multivariate differentiation

A function $$f : \mathbb{R}^n \rightarrow \mathbb{R}^m$$ as a directional derivative (sometimes called a Gateaux derivative) at $$x$$ in the direction $$u$$ if $$g(s) = f(x+su)$$ is differentiable at $$s = 0$$. In this case, we define the directional derivative $\frac{\partial f}{\partial u}(x) = \left. \frac{d}{ds} \right|_{s=0} f(x+su).$ A function is Frechet differentiable if there is a linear function $$f'(x) : \mathbb{R}^n \rightarrow \mathbb{R}^m$$, called the Frechet derivative or the Jacobian, such that for any direction $$u$$ $f(x+su) = f(x) + s f'(x) u + r(s),$ where the remainder term $$r(s)$$ is $$o(s)$$, i.e. $$r(s)/s \rightarrow 0$$ as $$s \rightarrow 0$$. A Frechet differentiable function clearly also has Gateaux derivatives in every direction, with $\frac{\partial f}{\partial u}(x) = f'(x) u.$

If we assume an inner product, the gradient $$\nabla f(x)$$ is the unique vector such that $\langle u, \nabla f(x) \rangle = f'(x) u.$ For $$\mathbb{C}^n$$ with the standard inner product, the gradient is just the conjugate transpose of the derivative, but other inner products give other notions of gradients. The negative gradient vector gives the direction of steepest descent with respect to the norm associated with the inner product. In some cases, it is also interesting to consider the direction of steepest descent the respect to other norms than Euclidean norms – e.g. the 1-norm.

The Frechet derivative of a function $$f$$ may itself be Frechet differentiable. That is, we may have a linear function $$f''(x) : \mathbb{R}^n \rightarrow \mathbb{R}^{n \times m}$$ such that $$f'(x+su) = f'(x) + s f'(x) u + o(s)$$. Rather than thinking of this as a linear map that produces linear maps, we generally think of $$f''(x)$$ as a multilinear map that takes two vectors in $$\mathbb{R}^n$$ as input and yields a vector in $$\mathbb{R}^m$$ as output. If $$u$$ and $$v$$ are the input arguments, we can write the components of the output of this map as $(f''(x)[u, v])_i = \sum_{j,k} f_{i,jk} u_j v_k,$ where we use the compact “indicial notation” $f_{i,jk} \equiv \frac{\partial^2 f_i}{\partial x_j \partial x_k} (x).$ When the partials are continuous near $$x$$, they commute, i.e. $$f_{i,jk} = f_{i,kj}$$. It is sometimes convenient to adopt the Einstein summation convention where we assume that repeated indices in a product are meant to be summed, and drop the explicit symbol $$\sum_{i,j}$$. For a function $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$, the Hessian is the matrix $$H_{\phi}$$ of second partial derivatives $$[H_\phi]_{ij} = \phi_{,ij}$$. The Hessian matrix is symmetric (or Hermitian, in the complex case) and naturally represents a quadratic form.

A nice notational convention, sometimes called variational notation (as in “calculus of variations”) is to write $\delta f = f'(x) \delta u,$ where $$\delta$$ should be interpreted as “first order change in,” so that a symbol like $$\delta u$$ is interpreted as a single object rather than a product of a scalar $$\delta$$ and the direction $$u$$. In introductory calculus classes, this sometimes is called a total derivative or total differential, though there one usually uses $$d$$ rather than $$\delta$$. There is a good reason for using $$\delta$$ in the calculus of variations, though, so that’s typically what I do.

### Chain rule

The chain rule tells us we can interchange linearization and composition of functions. If $$f : \mathbb{R}^n \rightarrow \mathbb{R}^m$$ and $$g : \mathbb{R}^p \rightarrow \mathbb{R}^n$$, then near a given $$y = g(x)$$ and $$z = f(y)$$ we have \begin{align} f(g(x+su)) &= f(y + s g'(x) u + o(s)) \\ &= z + s f'(y) g'(x) u + o(s). \end{align} Using variational notation, $\delta z = f'(y) \delta y, \quad \delta y = g'(x) \delta x.$ or, putting things together, $\delta z = f'(y) g'(x) \delta x.$ When we evaluate the composite function, the dependency between them means we generally first compute $$x$$, then $$y$$, then $$z$$. But associativity makes it easy to also reorder the expression as $\delta z = \left( f'(y) g'(x) \right) \delta x,$ i.e. we can compute the matrix $$f'(y) g'(x)$$ first, and then multiply by $$\delta x$$. Because this association proceeds “backwards” from the outputs to the inputs, it is sometimes called “backpropagation.”

Another way of writing the same equation is to think of computing the gradient (in the standard inner product): $\delta z = \langle \delta x, \nabla (f \circ g) \rangle$ where $\nabla_x (f \circ g) = (f'(y) g'(x))^T = g'(x)^T f'(y)^T = \nabla g(x) \nabla f(y).$

### Implicit functions

Suppose $$F : \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^n$$ is continuously differentiable, and write the Jacobian in block form as $F'(x,y) = \begin{bmatrix} \frac{\partial F}{\partial x} & \frac{\partial F}{\partial y} \end{bmatrix}.$ The implicit function theorem tells us that if $$F(x_0, y_0) = 0$$ and $$\partial F/\partial u$$ is nonsingular at $$(x_0, y_0)$$, then in a neighborhood $$\Omega$$ containing $$y_0$$ we can locally define a continuously differentiable function $$g : \mathbb{R}^m \rightarrow \mathbb{R}^n$$ such that $x_0 = g(y_0) \mbox{ and } F(g(y), y) = 0.$ As long as it is defined, we can differentiate such a $$g$$ using the chain rule. For conciseness, write $$u = (g(y), y)$$; then $\frac{\partial F}{\partial x}(u) g'(y) + \frac{\partial F}{\partial y}(u) = 0,$ and so $g'(y) = -\left( \frac{\partial F}{\partial x}(u) \right)^{-1} \left( \frac{\partial F}{\partial y}(u) \right).$ In variational notation, we would usually say that if $$x = g(y)$$, we have the variational relation $\frac{\partial F}{\partial x} \delta x + \frac{\partial F}{\partial y} \delta y = 0.$ This variational notation often simplifies life, particularly if the arguments to $$F$$ are really matrices.

As an example of using variational notation to represent differentiation of an implicit function, consider the problem of differentiating $$A^{-1}$$ with respect to every element of $$A$$. I would compute this by thinking of the relation between a first-order change to $$A^{-1}$$ (written $$\delta [A^{-1}]$$) and a corresponding first-order change to $$A$$ (written $$\delta A$$). Using the product rule and differentiating the relation $$I = A^{-1} A$$, we have $0 = \delta [A^{-1} A] = \delta [A^{-1}] A + A^{-1} \delta A.$ Rearranging a bit gives $\delta [A^{-1}] = -A^{-1} [\delta A] A^{-1}.$ One can do this computation element by element, but it’s harder to do it without the computation becoming horrible.

## 3.2 Taylor approximation

### Single variable

If $$f : \mathbb{R}\rightarrow \mathbb{R}$$ is differentiable, then by the fundamental theorem of calculus, we have $f(x+z) = f(x) + \int_{0}^{z} f'(s) \, ds.$ With two derivatives, we can integrate again to get \begin{align} f(x+z) &= f(x) + \int_{0}^z \left( f'(x) + \int_0^s f''(x+t) \, dt \right) \, ds \\ &= f(x) + f'(x) z + \int_0^z \int_0^{s_1} f''(x+s_2) \, ds_2 \, ds_1 \end{align} Continuing in this manner, for a $$k+1$$-times differentiable function, we have $f(x+z) = \sum_{j=0}^k \frac{1}{j!} f^{(j)} z^j + r(z)$ where $r(z) = \int_0^z \int_0^{s_1} \ldots \int_0^{s_k} f^{(k+1)}(x+s_{k+1}) \, ds_{k+1} \, \ldots \, ds_2 \, ds_1.$ If $$f$$ has $$k+1$$ continuous derivatives (i.e. $$f \in C^{k+1}$$), then we can apply the mean value theorem to write the remainder as $r(z) = \frac{1}{(k+1)!} f^{(k+1)}(x+\xi) z^{k+1}$ for some $$\xi \in [x, x+z]$$; this is the Lagrange form of the remainder. Often, we do not care about writing an exact formula for the remainder, and we will simply write $r(z) = o(z^k)$ if we are not assuming continuity of the $$k+1$$ derivative, or $r(z) = O(z^{k+1})$ if we do assume continuity. Here we use little $$o$$ and big $$O$$ in the order notation sense:

• $$r(z) = o(g(z))$$ if for any $$C > 0$$ there is an $$\epsilon > 0$$ such that for all $$|z| < \epsilon$$, $$|r(z)| \leq C g(z)$$. Put differently, $$\lim_{|z| \rightarrow 0} r(z)/g(z) = 0$$.
• $$r(z) = O(g(z))$$ if there is an $$\epsilon > 0$$ and a constant $$C > 0$$ such that for all $$|z| < \epsilon$$, $$|r(z)| \leq C g(z)$$.

We most frequently work with simple linear approximations, i.e. $f(x+z) = f(x) + f'(x) z + O(z^2),$ though sometimes we will work with the quadratic approximation $f(x+z) = f(x) + f'(x) z + \frac{1}{2} f''(x) z^2 + O(z^3).$

### Multivariable case

In more than one space dimension, the basic picture of Taylor’s theorem remains the same. If $$f : \mathbb{R}^n \rightarrow \mathbb{R}^m$$, then $f(x+z) = f(x) + f'(x) z + O(\|z\|^2)$ where $$f'(x) \in \mathbb{R}^{m \times n}$$ is the Jacobian matrix at $$x$$.

If $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$, then $\phi(x+z) = \phi(z) + \phi'(x) z + \frac{1}{2} z^T \phi''(z) z + O(\|z\|^3).$ The row vector $$\phi'(x) \in \mathbb{R}^{1 \times n}$$ is the derivative of $$\phi$$. A point at which the derivative is zero is a stationary point. The Hessian matrix $$\phi''(z)$$ is the matrix of second partial derivatives of $$\phi$$. The Hessian represents a quadratic form, and the inertia of the form (the number of positive, negative, and zero eigenvalues) can sometimes be used to tell us if a stationary point represents a local minimum or maximum (the so-called second derivative test).

Low-order Taylor expansions of multivariate functions are notationally nice, and we will rarely need to go beyond them. In the case that we do need to go further, we will use indicial notation with the summation convention, e.g. $f_i(x+u) = f_i(x) + f_{i,j}(x) u_j + \frac{1}{2} f_{i,jk}(x) u_j u_k + \frac{1}{6} f_{i,jkl}(x) u_j u_k u_l + \ldots.$

### Finite differencing

Suppose $$f : \mathbb{R}^n \rightarrow \mathbb{R}^m$$ is twice continuously differentiable. Then taking the Taylor expansion with remainder $f'(x + hu) = f'(x) + h f'(x) u + \frac{h^2}{2} f''(x+\xi u)[u, u],$ we have that $\frac{f(x+hu)-f(x)}{h} = f'(x) u + \frac{h}{2} f''(x+\xi u)[u,u] = f'(x) + O(h).$ Therefore, we can approximate $$f'(x) u$$ by a finite difference. We can also use a centered finite difference for higher order accuracy (assuming continuous third derivatives): $\frac{f(x+hu)-f(x-hu)}{2h} = f'(x) + \frac{h^2}{6} f'''(x+\xi u)[u,u,u] = f'(x) + O(h^2).$ Among other things, finite difference approximations are extremely useful when we want to sanity check an analytical formula for a derivative.

### Matrix series

Let $$f : \mathbb{C} \rightarrow \mathbb{C}$$ be represented near 0 by a power series $f(z) = \sum_{j=0}^\infty c_j z^j,$ and suppose that the series converges absolutely for $$|z| < \rho_f$$. Let $$A \in \mathbb{C}^{n \times n}$$ be a matrix such that for some consistent norm, $$\|A\| < \rho_f$$. By consistency of norms, for all $$j \geq 0$$, $\|A^j\| \leq \|A\|^j,$ and together with the triangle inequality, we have $\left\| \sum_{j=m}^n c_j A^j \right\| \leq \sum_{j=m}^n |c_j| \|A\|^j.$ Therefore, the partial sums of $$\sum_{j=0}^\infty c_j A^j$$ form a Cauchy sequence in the matrix space, and must converge to some limit $$f(A)$$, with the bound $\|f(A)\| \leq \sum_{j=0}^\infty |c_j| \|A\|^j.$ More broadly, if $$f$$ converges in an open set containing all the eigenvalues, then $$f(A)$$ converges.

### Neumann series

We don’t need to remember a library of Taylor expansions, but it is useful to remember that for real $$|\alpha| < 1$$, we have the geometric series $\sum_{j=0}^\infty \alpha^j = (1-\alpha)^{-1}.$ One of the most important matrix series is the Neumann series, which is the matrix generalization of the geometric series. If $$\|A\| < 1$$, then $$I-A$$ is invertible and has the convergent Neumann series $(I-A)^{-1} = \sum_{j=0}^\infty A^j.$ Using the norm bounds described above, along with convergence of the Neumann series, we have $\|(I-A)^{-1}\| \leq \sum_{j=0}^\infty \|A\|^j = \left( 1-\|A\| \right)^{-1}.$

## 3.3 Optimization

### Derivative tests

Suppose $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$ is differentiable at $$x$$. Then $\phi(x+s \nabla \phi(x)) = \phi(x) + s \|\phi'(x)\|^2 + r(s), \quad r(s) = o(s).$ If $$\phi'(x)$$ is nonzero, then there is some $$\epsilon > 0$$ so that for all $$0 < s < \epsilon$$, $\phi(x-s \nabla \phi(x)) < \phi(x) < \phi(x+s \nabla \phi(x)).$ Hence, if $$\phi'(x)$$ is nonzero, then $$x$$ cannot be either a local minimizer or a local maximizer of $$\phi$$. Taking the contrapositive: if $$\phi$$ is Frechet differentiable on all of $$\mathbb{R}^n$$, any local minimizer or maximizer must occur at a stationary point, i.e. a point where the derivative is zero.

Suppose $$\phi$$ is twice differentiable near $$x$$, and that $$x$$ is a stationary point (so $$\phi(x) = 0$$). Then $\phi(x+u) = \phi(x) + \frac{1}{2} u^T \phi''(x) u + o(s^2).$ Hence, near $$x$$ the dominant term in the Taylor expansion is the quadratic form associated with the Hessian $$\phi''(x)$$. Analyzing the Hessian gives us the second derivative test:

• When the Hessian is positive definite, the quadratic term is positive for $$u \neq 0$$, and so the stationary point is a strong local minimum.
• When the Hessian is negative definite, the quadratic term is negative for $$u \neq 0$$, and so the stationary point is a strong local minimum.
• When the Hessian has both positive and negative eigenvalues (directions of positive and negative curvature), the stationary point is neither a maximum nor a minimum, but a saddle point.
• When the Hessian is positive semidefinite or negative semidefinite, it could be a local minimum or a local maximum – but one needs to check higher order derivatives to be sure.

### Equality constraints

Now suppose we want to minimize $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$ over $$\Omega \subset \mathbb{R}^n$$ defined by equality constraints: $\Omega = \{x \in \mathbb{R}^n : \forall i \in [m], c_i(x) = 0 \}.$ for some $$c : \mathbb{R}^n \rightarrow \mathbb{R}^m$$. We assume both $$\phi$$ and $$c$$ are continuously differentiable.

If $$c'(x)$$ is full (row) rank, then there is an $$(n-m)$$-dimensional space of tangent directions to $$\Omega$$ at $$x$$, given by $$\{ \delta x : c'(x) \delta x = 0 \}.$$ The point $$x$$ cannot be a minimum or maximum if it obviously increases or decreases in one of these tangent directions; that is, a first-order necessary condition for $$x$$ to be an extremum is that $c'(x) \delta x = 0 \implies \phi'(x) \delta x = 0.$ The row vectors of $$c'(x)$$ form a basis for all row vectors that satisfy this condition. That is, in order for $$\phi'(x)$$ to satisfy the condition, we must be able to uniquely write $$\phi'(x)$$ as a linear combination of the row vectors of $$c'(x)$$. Put differently, we require the (unique) coefficients $$\lambda$$ so that $\phi'(x) + \sum_{i=1}^m \lambda_i c_i'(x) = \phi'(x) + \lambda^T c'(x) = 0.$ The coefficients $$\lambda_i$$ are known as Lagrange multipliers, and we can interpret this linear combination as the gradient of the Lagrangian function $L(x, \lambda) = \phi(x) + \lambda^T c(x).$

The stationarity of the Lagrangian gives us the analogue of the first derivative test in the uncase. The analogue of the second derivative test looks at the quadratic form associated with the Hessian $$\phi''$$ in directions that are consistent with the constraints. That is, suppose $$x_*$$ is a stationary point for the Lagrangian with full rank $$c'(x_*)$$, and let $$U$$ be a basis for the null space of $$c'(x_*)$$. Then

• We have a strong local minimum if $$U^* \phi''(x_*) U$$ is positive definite.
• We have a strong local maximum if $$U^* \phi''(x_*) U$$ is negative definite.
• We have a saddle if $$U^* \phi''(x_*) U$$ is indefinite.

As before, we need to consider higher derivatives if we want to diagnose the case where $$U^* \phi''(x_*) U$$ is positive or negative semidefinite.

Working with an explicit null space basis is often inconvenient, particularly for high-dimensional problems with a small number of constraints. In this case, an alternate form of the second derivative test involves looking at the bordered matrix which has the block form $H = \begin{bmatrix} \phi''(x_*) & c'(x_*)^T \\ c'(x) & 0 \end{bmatrix}.$ In this setting, the matrix $$H$$ is always indefinite, but we can write a version of the second derivative test in terms of the inertia:

• We have a strong local minimum if $$H$$ has $$n-m$$ positive eigenvalues.
• We have a strong local maximum if $$H$$ has $$n-m$$ negative eigenvalues.
• We have a saddle point if $$H$$ has at fewer than $$n-m$$ positive eigenvalues and fewer than $$n-m$$ negative eigenvalues.

### KKT conditions

Now suppose that we seek to optimize $$\phi : \Omega \rightarrow \mathbb{R}$$ where $$\Omega$$ is defined by equality and inequality constraints: $\Omega = \{ x \in \mathbb{R}^n : c_i(x) = 0, i \in \mathcal{E}\mbox{ and } c_i(x) \leq 0, i \in \mathcal{I}\},$ where $$\mathcal{E}$$ and $$\mathcal{I}$$ are index sets associated with equality and inequality constraints, respectively. Now we define the augmented Lagrangian $L(x, \lambda, \mu) = \phi(x) + \sum_{i \in \mathcal{E}} \lambda_i c_i(x) + \sum_{i \in \mathcal{I}} \mu_i c_i(x).$ The Karush-Kuhn-Tucker (KKT) conditions are first-order conditions for $$x_*$$ to be a constrained minimizer or maximizer: \begin{align*} \nabla_x L(x_*) &= 0 \\ c_i(x_*) &= 0, \quad i \in \mathcal{E} & \mbox{equality constraints}\\ c_i(x_*) & \leq 0, \quad i \in \mathcal{I} & \mbox{inequality constraints}\\ \mu_i & \geq 0, \quad i \in \mathcal{I} & \mbox{non-negativity of multipliers}\\ c_i(x_*) \mu_i &= 0, \quad i \in \mathcal{I} & \mbox{complementary slackness} \end{align*} The satisfaction of the equality and inequality constraints is also called primal feasibility, while the satisfaction of the non-negativity of the multipliers is called dual feasibility. We say an inequality constraint is active when the associated inequality is actually zero. As with the equality-constrained case, we need a condition on the constraints to avoid degeneracy, sometimes called a constraint qualification condition. The most frequently used constraint qualification condition is that the gradient of the active constraint terms should be linearly independent (sometimes known as LICQ: Linearly Independent Constraint Qualification).

The second derivative test in the inequality constrained case is basically the same as the test in the equality constrained case.

### Physical interpretation

A physical picture is often a useful device for remembering the stationarity conditions for optimization problems. In the unconstrained case, we can think about solving a minimization problem by rolling a tiny ball down hill until it came to rest. If we wanted to solve a constrained minimization problem, we could build a wall between the feasible and the infeasible region. A ball rolling into the wall can roll freely in directions tangent to the wall (or away from the wall) if those directions were downhill; at a constrained miminizer, the force pulling the ball downhill is perfectly balanced against an opposing force pushing into the feasible region in the direction of the normal to the wall. If the feasible region is $$\{x : c(x) \leq 0\}$$, the normal direction pointing inward at a boundary point $$x_*$$ s.t.~$$c(x_*) = 0$$ is proportional to $$-\nabla c(x_*)$$. Hence, if $$x_*$$ is a constrained minimum, we expect the sum of the rolling downhill’’ force ($$-\nabla \phi$$) and something proportional to $$-\nabla c(x_*)$$ to be zero: $-\nabla \phi(x_*) - \mu \nabla c(x_*) = 0.$ The Lagrange multiplier $$\mu$$ in this picture represents the magnitude of the restoring force from the wall balancing the tendency to roll downhill.

### Convexity

A function $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$ is convex if for any distinct $$x, y \in \mathbb{R}^n$$ and for all $$\alpha \in (0, 1)$$ $\phi\left(\alpha x + (1-\alpha)y\right) \geq \alpha \phi(x) + (1-\alpha) \phi(y).$ We say $$\phi$$ is strictly convex if the inequality is strict.

A subset $$\Omega \subset \mathbb{R}^n$$ (or, more generally, a subset of a vector space) is said to be convex if for any $$x, y \in \Omega$$ and all $$\alpha \in (0,1)$$, the points $$\alpha x + (1-\alpha) y$$ lie in $$\Omega$$. We say $$\Omega$$ is strictly convex if the points $$\alpha x + (1-\alpha) y$$ lie in the interior of $$\Omega$$. Convex sets are closed under intersection and direct sum.

The definition of a convex set is arguably fundamental than the notion of a convex function, as we often express arguments about the latter in terms of the former via the epigraph of the function.
The epigraph (or supergraph of a function $$\phi : \mathbb{R}^n \rightarrow \mathbb{R}$$ is the set on or above the graph of $$phi$$, i.e. $\operatorname{epi}(\phi) = \{ (x,y) \in \mathbb{R}^n \times \mathbb{R}: y \geq f(x) \}.$ A function $$\phi$$ is convex iff the epigraph is a convex set.

If $$\Omega$$ is a convex set and $$x \in \Omega$$ is a boundary point, then there is a supporting hyperplane at $$x$$ defined by a functional $$w_*$$ such that $$y \in \Omega \implies w_* (y-x) \geq 0$$. In the case of a strictly convex set, the equality can only hold when $$y = x$$. For functions $$\phi$$, this means that there is some dual vector $$w$$ such that for all $$z$$, $\phi(x+z) \geq \phi(x) + w^* z,$ and for a strictly convex function, equality only holds when $$z = 0$$. When $$\phi$$ is differentiable, $$w^* = \phi'(x)$$. At points where $$\phi$$ is not differentiable, there will often be several possible choices for $$w$$. The collection of such choices makes up the subgradient at $$x$$. For convex functions, we generalize the notion of a stationary point to mean “point $$x$$ at which the subgradient contains zero, i.e. $$\phi(x+z) \geq \phi(x)$$.”

Convex functions are particularly nice for optimization. A convex function does not need to have a minimum or a maximum on a convex set (for example the exponential function on the real line has neither). But if $$\phi$$ is convex and we consider optimization over a conves set $$\Omega$$, then

• an interior point is a minimizer iff it is a stationary point;
• all minimizers are global minimizers;
• the set of all minimizers is convex;

If $$\phi$$ is strictly convex, then there is at most one global minimizer.

For functions that are twice differentiable, convexity just means that the Hessian matrix is positive semidefinite everywhere. But many important nonsmooth functions are also convex. For example, norms must be convex, but cannot be differentiable.

## 3.4 Metric spaces

A metric space is a set $$\Omega$$ together with a distance function (or metric) $$d$$ satisfying for all $$x, y, z \in \Omega$$

• Symmetry: $$d(x,y) = d(y,x)$$
• Positive definiteness: $$d(x,y) \geq 0$$ with equality iff $$x = y$$
• Triangle inequality: $$d(x,y) \leq d(x,z) + d(z,y)$$

Any normed vector space is also a metric space with the norm $$d(x,y) = \|x-y\|$$. Also, any subset of a metric space is a metric space. The topology associated with a metric space is analogous to that of the reals: a subset $$\mathcal{U}\subset \Omega$$ is open if for any $$x \in \mathcal{U}$$ there is an $$\epsilon > 0$$ such that the ball $$B_{\epsilon}(u) = \{ v \in \Omega : d(v,u) < \epsilon \}$$ is contained within $$\mathcal{U}$$.

A Cauchy sequence in a metric space is a sequence $$x_1, x_2, \ldots$$ such that for any $$\epsilon > 0$$, points far enough out in the sequence are all within $$\epsilon$$ of each other (i.e. $$\forall \epsilon >0, \exists N : \forall j, k \geq N, d(x_j, x_k) < \epsilon$$). A metric space is complete if all Cauchy sequences converge. Any closed subset of a complete metric space is itself complete.

A complete normed vector space is called a Banach space. Any finite-dimensional normed vector space over a complete field like $$\mathbb{R}$$ or $$\mathbb{C}$$ is a Banach space. nfinite-dimensional normed vector spaces over $$\mathbb{R}$$ or $$\mathbb{C}$$ do not always need to be complete, but most infinite-dimensional normed vector spaces we use regularly are complete.

The metric space $$\Omega$$ is compact if any open cover of $$\Omega$$ has a finite subcover. In the case of finite-dimensional normed vector spaces, any closed and bounded subset is compact (this is not true in infinite-dimensional normed vector spaces). One reason we care about compactness is that any continuous function on a compact set achieves a minimum an maximum value.

## 3.5 Lipschitz constants

Suppose $$f : \Omega \subset \mathcal{U}\rightarrow \mathcal{V}$$ is a map between metric spaces. We say $$f$$ is Lipschitz with constant $$M$$ if for all $$x, y \in \Omega$$, $d(f(x), f(y)) \leq M d(x,y).$ The concept of Lipschitz continuity is broadly useful in analysis.

If $$\mathcal{U}$$ and $$\mathcal{V}$$ are normed vector spaces and $$f$$ is continuously differentiable on $$\Omega$$, any bound on $$\|f'(x)\|$$ over $$\Omega$$ is a Lipschitz constant (and the tightest Lipschitz constant is $$\sup_{x \in \Omega} \|f'(x)\|$$). But a function can easily be Lipschitz even if it is not differentiable; for example, the absolute value function on $$\mathbb{R}$$ is Lipschitz with constant 1.

Having a Lipschitz constant is not as nice as having a derivative. However, we get some of the same nice properties For example, if $$f$$ and $$g$$ are Lipschitz functions with constants $$M$$ and $$N$$ and the composition $$f \circ g$$ makes sense, then $$f \circ g$$ is Lipschitz with constant $$M N$$. If $$f + g$$ makes sense, then it is Lipschitz with constant $$M+N$$. If $$f$$ and $$g$$ are Lipschits and bounded and the product $$\langle f, g \rangle$$ makes sense, then $$\langle f, g \rangle$$ is Lipschitz with constant $$M \max \|g\| + N \max \|f\|$$. And if $$f$$ is $$k$$-times continuously differentiable and the $$k$$th derivative has Lipschitz constant $$M$$, then we have that the residual error in Taylor approximation through the $$k$$th degree term is bounded by $$Mr^{k+1}/(k+1)!$$, where $$r$$ is the distance from the center of the Taylor series.

## 3.6 Contraction mappings

A contraction mapping $$G : \Omega \rightarrow \Omega$$ is a Lipschitz function on a set $$\Omega$$ with constant $$\alpha < 1$$. We sat the map $$G$$ is locally contractive near $$x$$ if it is Lipschitz with constant $$\alpha < 1$$ in some local neighborhood of $$x$$. Contraction mappings are a useful tool both for showing the existence and uniqueness of solutions to systems of equations (or optimization problems) and for constructing algorithms to find such solutions.

### Banach fixed point theorem

Assuming $$\Omega$$ is a closed subset of a Banach space1, then $$G$$ has a unique fixed point $$x_* \in \Omega$$, i.e. a unique point such that $$G(x_*) = x_*$$. This fact is variously called the contraction mapping theorem and the Banach fixed point theorem. The proof is interesting because it is a construction that can be carried out numerically. Let $$x_0 \in \Omega$$ be an arbitrary starting point, and consider the fixed point iteration $$x_{k+1} = G(x_k)$$. By contractivity, $\|x_{k+1} - x_k\| = \|G(x_k) - G(x_{k-1})\| \leq \alpha \|x_k-x_{k-1}\|,$ and by induction on this fact, $\|x_{k+1} - x_k\| \leq \alpha^k \|x_1-x_0\|.$ For any $$l > k$$, we have \begin{align} \|x_l - x_k\| & = \left\| \sum_{j=k}^{l-1} (x_{j+1}-x_j) \right\| \\ & \leq \sum_{j=k}^{l-1} \|x_{j+1}-x_j\| \\ & \leq \sum_{j=k}^{l-1} \alpha^j \|x_1-x_0\| \\ & \leq \alpha^k \frac{\|x_1-x_0\|}{1-\alpha}. \end{align} Therefore, we have a Cauchy sequence that converges to a limit point $$x_*$$, which is the fixed point. Uniqueness comes from the fact that if $$x_*$$ and $$x'_*$$ are both fixed points in $$\Omega$$, then $\|x_*-x'_*\| = \|G(x_*)-G(x'_*)\| \leq \alpha \|x_*-x'_*\|,$ which implies that $$\|x_*-x'_*\| = 0$$, so $$x_* = x'_*$$. Moreover, at any given step $$k$$, we have the error bound $\|x_k-x_*\| \leq \frac{\|x_{k+1}-x_k\|}{1-\alpha}.$

### Local convergence

Now suppose that $$G$$ has a fixed point $$x_*$$, and $$\|G'(x)\| \leq \alpha < 1$$ over some closed ball $$\bar{B}_{\rho}(x_*) = \{x : \|x-x_*\| \leq \rho\}$$. Then $\forall x \in \bar{B}_\rho(x_*), \|G(x)-x_*\| = \|G(x)-G(x_*)\| \leq \alpha \|x-x_*\| < \|x-x_*\|$ and so $$G$$ maps the ball into itself. Therefore, $$G$$ is a contraction mapping on $$\bar{B}_\rho(x_*)$$, and fixed point iteration from any starting point in that ball will converge to the unique fixed point $$x_*$$ within the ball.

### Preventing escape

The contraction mapping theorem is useful both for telling us that a fixed point exists and is unique, and for giving us an iteration that converges to that fixed point. But sometimes it is difficult to get global contractivity. If we know a fixed point exists, we have just shown that a “local” notion of contractivity around that fixed point is enough. But what if we do not have global contractivity and also are not sure that a fixed point exists? Fortunately, a condition preventing “escape” from a local region of contractivity is sometimes good enough.

For example, suppose $$\|G'(x_0)\| \leq \alpha$$ and $$G'$$ is Lipschitz with constant $$M$$. Consider the fixed point iteration $$x_{k+1} = G(x_k)$$ starting from $$x_0$$, and let $$d_1 = \|x_1-x_0\|$$. Then if $$\alpha' = \alpha + Md_1 < 1$$, we can show

• $$G$$ is Lipschitz with constant $$\alpha'$$ on a ball of radius $$d_1/(1-\alpha')$$ about $$x_0$$.
• By an induction: The iterates satisfy $$\|x_k-x_0\| \leq \frac{1-(\alpha')^k}{1-\alpha'} d_1 < d_1/(1-\alpha'),$$ i.e. the iterates stay in the ball; and therefore they continue to satisfy $$\|x_{k+1}-x_k\| \leq (\alpha')^k d_1$$.

Therefore in this situation as well, the iterates converge to a fixed point that is at most $$d_1/(1-\alpha')$$ away from the starting point. As with the contractive mapping theorem, this is enough for us to show that we can show by induction that the iteration remains within a ball of radius $$d_1/(1-\alpha')$$ around $$x_0$$, that it converges to some $$x_*$$ in that ball, and that the convergence is geometric with rate constant $$\alpha$$.

1. The Banach fixed point theorem applies to any complete metric space. But all the examples in this class will be closed subsets of Banach spaces, so we will stick to that setting.↩︎