5 Calculus and analysis

We assume the reader has standard introductory courses in single-variable and multivariable calculus. It will also be helpful to have seen some elements of analysis, including the $ϵ - δ$ definition of continuity. Otherwise, it will be helpful to be familiar with the Julia programming language (Chapter 2) and with the contents of the previous chapter (Chapter 4).

5.1 Formulas and foundations

The modern conception of “calculus” grew out of the 17th century work of Newton and Liebnitz. Many elements of calculus preceded Newton and Liebnitz; indeed, some arguments with a flavor of calculus were known to Archimedes and other Greek geometers. But Newton and Liebnitz created something new by producing “the Calculus, a general symbolic and systematic method of analytic operations, to be performed by strictly formal rules, independent of geometric meaning” (p. 84, Rosenthal 1951). The mechanical nature of calculus means that students and computers can routinely manipulate derivatives and integrals without constantly (or ever) having to think deeply about the geometry of the problem at hand. Indeed, the system of calculus has been generalized in many ways to situations that bear little resemblance to geometric problems at all, but still follow the formal patterns that make the machinery work.

Formal calculations are power tools: easily dispatching intractable problems in the right hands, but dangerous when used incautiously. This problem is not unique to calculus: the road to “proofs” that $1 = 0$ is lined with use of the identity $x y / x = y$ without checking if $x = 0$ . Such cautionary examples aside, many formal calculations produce valid results even when they have no right to do so, leaving mathematicians the challenge of figuring out what the “right” set of hypotheses really are.

The development of the calculus was a product of the seventeenth century, but the early formal system was somewhat unsound: the rules were missing important hypotheses, and could potentially lead to incorrect conclusions. Though concerns about the solidity of the foundations of calculus go back to the time of Newton and Liebnitz, many mathematicians were perfectly happy to restrict their attention to functions “nice enough” so that the rules of the calculus worked and get on with it¹. The project to develop definitions and hypotheses to make the formal rules of calculus sound did not really get fully underway until the nineteenth century, with the work of Cauchy and even more of Weierstrass, the “father of modern analysis” and originator of the $ϵ - δ$ definitions and arguments familiar to modern students of calculus and analysis.

The $ϵ - δ$ world of Weierstrass has an asymptotic flavor: for a given $ϵ > 0$ , we eventually get within $ϵ$ of a target, for small enough $δ$ for for large enough $N$ . Such arguments and definitions are often very convenient for establishing that a solution to some problem exists, or that it has certain properties. For numerical computation, though, we often want something stronger. Saying a sequence converges to a limit in the long run is all well and good — but fast as computers are, in the long run we are all dead².

In our review of calculus and analysis, we will give a great deal of attention to the formal machinery of the calculus. But we also want some reassurance that our computations make sense, and so we will take some care to give a proper account of what regularity hypotheses are needed for our calculations to work. We will frequently be somewhat more conservative with these hypotheses than is strictly necessary. This is partly for convenience, but there is more to it. Numerical methods are inherently approximate and finite things, and so we would like strong enough hypotheses to ensure that our calculations become “right enough, fast enough” – preferably with quantifiable meanings for both “right enough” and “fast enough.”

5.2 Metric spaces

We will mostly be concerned with normed vector spaces, but sometimes we will want something a little more general. For the present section, we consider metric spaces and their structure, as this is the minimal structure that we really need for Weierstrass-style $ϵ - δ$ arguments to make sense. A metric space is simply a set $X$ with a distance function (a metric) $d : X \times X \to R$ with three properties:

Symmetry: $d (x, y) = d (y, x)$ ;
Positive definiteness: $d (x, y) \geq 0$ , and $d (x, y) = 0$ iff $x = y$ ;
Triangle inequality: $d (x, z) \leq d (x, y) + d (y, z)$ .

A normed vector space is a metric space where the distance is just the norm of the difference, i.e. $d (u, v) = ∥ u - v ∥$ . But there are many other examples that are relevant, including the shortest distance between points in a network or on a manifold, and we will see these in our discussions of nonlinear dimensionality reduction and network analysis.

5.2.1 Metric topology

A metric space automatically comes with a metric topology³ where the open sets can be generated from unions of open balls $B_{δ} (x) = {y \in X : d (x, y) < δ} .$ A set containing such a ball is sometimes called a “neighborhood” of $x$ . For any subset $Ω \subset X$ , we say $x \in Ω$ is an interior point if $Ω$ contains a neighborhood of $x$ ; otherwise, it is a boundary point. The complements of open sets are closed sets. A closed set contains all its limit points; that is, if $C \subset X$ is closed, and for every $δ > 0$ there is a $y \in C$ with $d (x, y) < δ$ , then $x$ must also be in $C$ .

The idea of continuity of a function makes sense for any topological spaces, using a definition in terms of open sets: a function $f : X \to Y$ is continuous if for any open $U \subset Y$ , the preimage $f^{- 1} (U) = {x \in X : f (x) \in U}$ is also open. For metric spaces, this coincides with the more usual notion of continuity due to Weierstrass, i.e. that if $f (x) = y$ then there is some neighborhood $B_{δ} (x)$ in the preimage $f^{- 1} (B_{ϵ} (y))$ .

5.2.2 Convergence

When we want to numerically find the minimum or maximum of a function, or solve some system of equations, we generally have to deal with two questions:

Does the thing we want even exist?
Can we find it (at least approximately), ideally in a reasonable amount of time?

One standard way to tackle both these questions is to construct a sequence of approximate solutions $x_{1}, x_{2}, \dots \in X$ in an appropriate metric space $X$ . The sequence converges to a limit $x_{*}$ if for every neighborhood $U$ of $x_{*}$ , the sequence eventually enters and stays in $U$ . Taking balls of radius $ϵ$ as neighborhoods gives us the usual Weierstrass-style definition of convergence: $\forall ϵ > 0, \exists N : \forall n \geq N, d (x_{n}, x_{*}) < ϵ .$ If a sequence of approximate solutions $x_{j}$ converges to a limit $x_{*}$ and we have some property that guarantees that a limit of approximate solutions is an exact solution for the problem in hand, then this is a constructive way to get a handle on the thing we want.

A convergent sequence of approximations gives us a way to compute arbitrarily good approximations to the limit $x_{*}$ — eventually. But the word “eventually” is rarely a satisfactory answer to the question “when will we get there?” When we are focused on computation (as opposed to existence proofs), we typically seek a more quantitative answer, e.g. $d (x_{n}, x_{*}) \leq γ_{n}$ where $γ_{n}$ is some sequence with well-defined convergence properties. Alternately, we might give some function $ν (ϵ)$ such that $d (x_{n}, x_{*}) \leq ϵ$ for $n \geq ν (ϵ)$ .

For example, we say $x_{n}$ converges geometrically (or sometimes linearly) to $x_{*}$ if $d (x_{n}, x_{*}) \leq α^{n} d (x_{0}, x_{*})$ for some $0 \leq α < 1$ (often called the rate constant). If $x_{n}$ is geometrically convergent with rate $α > 0$ , then we can guarantee $d (x_{n}, x_{*}) < ϵ$ for $n > ν (ϵ) = (\log (ϵ) - \log (d (x_{0}, x_{*}))) / \log (α)$ .

5.2.3 Completeness

A Cauchy sequence $x_{1}, x_{2}, \dots$ in a metric space $X$ is a sequence such that for any $ϵ > 0$ there is an $N$ such that for all $i, j \geq N$ , $d (x_{i}, x_{j}) < ϵ$ . We say Cauchy sequences $x_{1}, x_{2}, \dots$ and $y_{1}, y_{2}, \dots$ are equivalent if $d (x_{i}, y_{i}) \to 0$ . We might believe that if there is any justice in the world, Cauchy sequences should converge, and equivalent Cauchy sequences should converge to the same thing. Unsurprisingly, though, sometimes we need to create our own justice.

A complete metric space is one in which Cauchy sequences converge. Alas, not all metric spaces are complete. However, for any metric space $X$ we can define a new metric space $\bar{X}$ whose elements are equivalence classes of Cauchy sequences in $X$ , with the metric given by $d ({x_{i}}, {y_{i}}) = lim_{i \to \infty} d (x_{i}, y_{i})$ . There is also a canonical embedding of $X$ into $\bar{X}$ which preserves the metric, given by taking $x \in X$ to the equivalence class associated with the constant sequence $x, x, \dots$ . The space $\bar{X}$ is the completion of $X$ — and, by construction, it is complete, and every element of $\bar{X}$ can be expressed as a limit point for the embedding of $X$ into $\bar{X}$ .

Every finite-dimensional vector space over $R$ or $C$ is complete. There are also many complete infinite-dimensional vector spaces: for example, the space $C ([a, b], R)$ of continuous real-valued functions on a closed interval $[a, b]$ is complete under the sup norm. However, there are also many examples of function spaces that are not complete, and in order to make analysis more tractable, we usually work with the completions of these spaces. A complete normed vector space is called a Banach space, and a complete inner product space is called a Hilbert space.

5.2.4 Compactness

In $R^{n}$ (or any finite-dimensional normed vector space), there is a special role for sets that are closed and bounded. We generalize this to the idea of compact sets. The standard definition of a set $K$ being compact is that any open cover $C$ (a collection of open sets such that every point in $Y$ belongs to some $U \in C$ ) has a finite subcover, i.e. $Y \subset_{i = 1}^{n} U_{i}$ where each $U_{i}$ belongs to the original cover $C$ . One sometimes considers covers consisting of open balls, in which case compactness of $K$ is sometimes colloquiually rendered as “ $K$ can be guarded by a finite number of arbitrarily nearsighted watchmen.”

Compact metric spaces have a number of useful properties. For example, the continuous image of a compact set is compact; this means, for example, that if $f : K \to R$ is continuous and $K$ is compact, then $f (K)$ is also compact — so closed and bounded (for $R$ ). Therefore, a continuous function on a compact set $K$ always has a finite maximum and minimum that are achieved somewhere in $K$ . Compact metric spaces are also sequentially compact⁴, which means that if $K$ is compact and $x_{1}, x_{2}, \dots$ is a sequence in $K$ , then there is some subsequence $x_{j_{1}}, x_{j_{2}}, \dots$ that converges to a point in $K$ .

In a finite dimensional normed vector space, the closed unit ball ${v \in V : ∥ v ∥ \leq 1}$ is always compact (since it is a continuous map of a closed and bounded set in $R^{n}$ via a basis). One can use the fact that norms are continuous to show that this implies norm equivalence among finite-dimensional spaces, i.e. for any norms $∥ \cdot ∥$ and $∥ \cdot ∥^{'}$ on a finite dimensional $V$ , there exist constants $0 < m \leq M < \infty$ such that $\forall v \in V, m ∥ v ∥^{'} \leq ∥ v ∥ \leq M ∥ v ∥^{'} .$ The constants may be very big or small in general, but they exist! In contrast, for infinite dimensional spaces, the closed unit ball is generally not compact, and different norms give different topologies.

5.2.5 Contractions

One of the more frequently-used theorems in numerical analysis is the Banach fixed point theorem, also known as the contraction mapping theorem.

Theorem 5.1 (Banach fixed point theorem) Suppose $X$ is a (non-empty) complete metric space and $f : X \to X$ is a contraction, i.e. there is some $0 \leq α < 1$ such that $d (f (x), f (y)) \leq α d (x, y)$ . Then there exists a unique point $x_{*} \in X$ such that $f (x_{*}) = x_{*}$ .

Proof. Choose any $x_{0} \in X$ ; there must be at least one since $X$ is non-empty. Using $x_{0}$ as a starting point, define the sequence $x_{0}, x_{1}, x_{2}, \dots$ by the iteration $x_{k + 1} = f (x_{k})$ . By contractivity, $d (x_{k}, x_{k + 1}) \leq α^{k} d (x_{0}, x_{1})$ . By the triangle inequality and a geometric series bound, for any $j > i$ , $d (x_{i}, x_{j}) \leq \sum_{k = i}^{j - 1} d (x_{k}, x_{k + 1}) \leq \sum_{k = i}^{j - 1} α^{k} d (x_{0}, x_{1}) \leq α^{i} \frac{d (x_{0}, x_{1})}{1 - α} .$ Therefore, for any $ϵ > 0$ we can choose an $N$ such that for all $i, j > N$ , $d (x_{i}, x_{j}) \leq α^{n} d (x_{0}, x_{1}) / (1 - α) < ϵ$ , i.e. the iterates $x_{j}$ form a Cauchy sequence, which must converge to some $x_{*} \in X$ by completeness.

Now suppose $x_{*}^{'} \in X$ satisfies $x_{*}^{'} = f (x_{*}^{'})$ . Then by contractivity and the fixed point condition, $d (x_{*}, x_{*}^{'}) = d (f (x_{*}), f (x_{*}^{'})) < α d (x_{*}, x_{*}^{'}),$ which can only happen if $d (x_{*}, x_{*}^{'}) = 0$ . Therefore $x_{*}$ and $x_{*}^{'}$ must be the same point.

The Banach fixed point theorem is useful because it gives us not only the existence of a fixed point, but also a method to compute arbitrarily good approximations to that fixed point with a rate of convergence. Consequently, this theorem is a mainstay in the convergence analysis of many iterative numerical methods.

5.3 Continuity and beyond

The study of real analysis often devolves into a study of counterexamples: we start with something that seems like it should be true about a function, give an example where it is not, and then come up with a hypothesis under which it is true. While we make frequently resort to assuming that everything is “as nice as possible” (several times continuously differentiable, at least), it is worth a page or two to talk about more modest hypotheses and what they give us. In particular:

Continuity gives us the intermediate value theorem.
Uniform continuity gives us a set of functions that is closed under (uniform) limits.
Absolute continuity gives us the fundamental theorem of calculus.
Bounded variation gives us a dual space to continuous functions.

For stronger control that can be used to prove approximation bounds, we might turn Lipschitz continuity (or a more general modulus of continuity).

We will focus on functions between normed vector spaces. But almost everything in this section, and indeed in this chapter, generalizes beyond this setting.

5.3.1 Continuity

When we compute, we almost always presume functions that are continuous almost everywhere. Floating point cannot exactly represent every real number, and so when we evaluate a function from $R$ to $R$ on the computer, it helps if the small changes to the input (and the output) can be forgiven.

Let $V$ and $W$ are normed vector spaces, and suppose $Ω \subset V$ . We say $f : Ω \subset V \to W$ is continuous at $x \in Ω$ if $\forall ϵ > 0, \exists δ > 0 : \forall y \in Ω, ∥ x - y ∥ \leq δ ⟹ ∥ f (y) - f (x) ∥ \leq ϵ$ In words: for any tolerance $ϵ$ there is a corresponding tolerance $δ$ so that getting $y$ within $δ$ of $x$ means we will get $f (y)$ within $ϵ$ of $f (x)$ . Composition of continuous functions is continuous: if $g$ is continuous at $f (x)$ and $f$ is continuous at $x$ , then $g \circ f$ is continuous at $x$ .

A function is continuous on $Ω$ if it is continuous at each point in $Ω$ . The set of such continuous functions from $Ω \subset V$ to $W$ is called $C (Ω, W)$ . The continuous functions $C (Ω, W)$ form a vector space: sums of continuous functions are continuous, and so are scalar multiples of continuous functions.

For functions from $[a, b] \subset R$ to $R$ , the intermediate value theorem is a fundamental result: if $x, y \in [a, b]$ take on values $f (x)$ and $f (y)$ , then for every target value $f_{*} \in (f (x), f (y))$ , there is a $z \in (x, y)$ such that $f (z) = f_{*}$ . The intermediate value theorem is fundamentally one-dimensional, but still tells us useful things about functions between vector spaces through composition: if $f : Ω \subset V \to W$ and $w^{*} \in W^{*}$ is any (bounded) linear functional, then, the function $s \in [0, 1] \mapsto w^{*} f ((1 - s) x + s y)$ is a continuous real-valued function on the interval $[0, 1]$ for which the intermediate value theorem applies.

5.3.2 Uniform continuity

A function is uniformly continuous if the same $ϵ$ - $δ$ relation works for every $x$ in $Ω$ . In a finite-dimensional space, any continuous function on a closed and bounded $Ω$ is automatically also uniformly continuous. A sequence of functions is uniformly equicontinuous if the same $ϵ$ - $δ$ relation works for every $x \in Ω$ and for every function in the sequence.

We say that $f_{k} \to f$ uniformly if for all $ϵ > 0$ there is an $N$ such that for $k > N$ we must have $∥ f_{k} - f ∥_{\infty} \leq ϵ$ , where $∥ f_{k} - f ∥_{\infty} = sup_{x \in Ω} ∥ f_{k} (x) - f (x) ∥$ . This matters because while a pointwise limit of continuous functions does not have to be continuous, a uniform limit of uniformly continuous functions is uniformly continuous.

If $f : [a, b] \subset R \to R$ is continuous (and therefore uniformly continuous), the Weierstrass approximation theorem says that $f$ can be written as the uniform limit of a sequence of polynomials.

5.3.3 Absolute continuity

We say $f : Ω \subset R \to R$ is absolutely continuous on an interval if $\begin{aligned} \forall ϵ > 0, \exists δ > 0 : & \forall {(x_{i}, y_{i}) \subset Ω}_{i = 1}^{n} disjoint, \\ \sum_{i = 1}^{n} | x_{i} - y_{i} | < δ ⟹ \sum_{i = 1}^{n} | f (x_{i}) - f (y_{i}) | < ϵ . \end{aligned}$ Absolutely continuous functions are those that are nice enough that the (Lebesgue) fundamental theorem of calculus applies; that is, the function $f$ has derivatives almost everywhere, and $f (b) = f (a) + \int_{a}^{b} f^{'} (x) d x .$ Most (perhaps all) continuous functions that we will encounter in this class will be absolutely continuous.

5.3.4 Bounded variation

The total variation of a function $f : [a, b] \to R$ is a supremum over partitions $a = x_{0} < x_{1} < \dots < x_{n_{P}} = b$ of how much the function value jumps: $V_{a}^{b} = sup_{P} \sum_{i = 0}^{n_{P} - 1} | f (x_{i + 1}) - f (x_{i}) | .$ For continuously differentiable functions, the total variation is the integral of the absolute value of the derivative. The notion generalizes to functions on more general vector spaces, though it is somewhat more technical and involves a little measure theory.

Bounded variation functions (those with finite total variation) are those that are nice enough that they can be integrated against continuous functions (in the Riemann sense). Indeed, bounded variation functions correspond to the dual space to $C [a, b]$ : every linear functional $w^{*} \in (C [a, b])^{*}$ can be written as $w^{*} f = \int_{a}^{b} f (t) d v (t)$ where $v$ is a bounded variation function and $d v (t)$ is its (distributional) derivative.

5.3.5 Lipschitz continuity

A function $f : Ω \subset V \to W$ is Lipschitz continuous with constant $C$ if $\forall x, y \in Ω$ , $∥ f (x) - f (y) ∥ \leq C ∥ x - y ∥ .$ Unlike other notions of continuity we have considered, Lipschitz continuity involves no explicit $ϵ - δ$ relationship. Also unlike other notions of continuity, Lipschitz continuity gives us control on the relationship between function values even at points that may be far from each other.

5.3.6 Modulus of continuity

We have already mentioned the classic theorem of Weierstrass says that a continuous real-valued function $f$ on a finite closed interval $[a, b]$ can be approximated arbitrarily well in the $L^{\infty}$ sense, i.e. $\forall ϵ > 0, \exists p \in P_{d} : ∥ f - p ∥_{\infty} \leq ϵ,$ where $∥ f - p ∥_{\infty} = max_{x \in [a, b]} | f (x) - p (x) |$ . But while this is a useful theorem, it lacks the level of detail we would like if we are trying to compute. What degree polynomial is needed for a particular $ϵ$ ? How do we construct such a polynomial? We will return to some of these points later in the book, but for now we simply want to make the point that we usually want more information about our functions than simply that they are continuous. In particular, we would like to be able to say something more specific about a relationship between $ϵ$ and $δ$ , either close to a particular point or more globally, e.g. making a statement of the form $∥ f (x) - f (y) ∥ \leq g (x - y)$ for appropriate $x$ and close enough $y$ , where $g$ is some well-understood real-valued gauge function. One example of this is Lipschitz continuity, but it useful to generalize to other gauge functions.

The modulus of continuity for a function $f$ on some domain $Ω$ is a non-decreasing function $ω (δ)$ for $δ > 0$ given by $ω (δ) = sup_{x, y \in Ω : ∥ x - y ∥ \leq δ} ∥ f (x) - f (y) ∥ .$ Put slightly differently, the modulus of continuity is the minimal function such that $f (x + d) = f (x) + r (x, d), ∥ r (x, d) ∥ \leq ω (∥ d ∥) .$ for any $x$ and $x + d$ in $Ω$ . A function $f$ is uniformly continuous iff $ω (δ) \to 0$ as $δ \to 0$ .

Several standard rules for differentiation have analogous inequalities involving the modulus of continuity. For example, if $f : Ω \to V$ and $g : Ω \to V$ are functions with moduli of continuity $ω_{f}$ and $ω_{g}$ , then

$f + g$ has modulus of continuity $ω_{f + g} (δ) \leq ω_{f} (δ) + ω_{g} (δ)$ .
$α f$ has modulus of continuity $ω_{α f} (δ) \leq | α | ω_{f} (δ)$
If $w^{*} \in V^{*}$ , then $ω_{w^{*} f} (δ) \leq ∥ w^{*} ∥ ω_{f} (δ)$
If $V$ is an inner product space, $x \mapsto ⟨ f (x), g (x) ⟩$ has modulus of continuity $ω_{f g} (δ) \leq ω_{f} (δ) ∥ g ∥_{\infty} + ∥ f ∥_{\infty} ω_{g} (δ)$ where $∥ g ∥_{\infty} = sup_{x \in Ω} ∥ g (x) ∥$ and similarly with $∥ f ∥_{\infty}$ .

If $f : Ω \to R$ , and $β = inf_{x \in Ω} | f (x) |$ , we also have $ω_{1 / f} (δ) \leq ω_{f} (δ) / β^{2}$ . Finally, if $f : Ω \to V$ and $g : f (Ω) \to W$ , we have $ω_{g \circ f} (δ) \leq ω_{g} (ω_{f} (δ)) .$ The proof of these facts is left as an exercise for the reader.

A function $f$ is Lipschitz continuous on $Ω$ if the modulus of continuity on $Ω$ satisfies $ω_{f} (δ) \leq C δ$ for some $C$ . Put differently, a Lipschitz function satisfies $\forall x, y \in Ω, ∥ f (x) - f (y) ∥ \leq C ∥ x - y ∥ .$ The constant $C$ is a Lipschitz constant for $f$ .

A more general notion than Lipschitz continuity is $α$ -Hölder continuity, which is the condition that $ω_{f} (δ) \leq C δ^{α}$ This condition is interesting for $0 < α \leq 1$ , with $α = 1$ corresponding to Lipschitz continuity. The only $α$ -Hölder continuous functions for $α > 1$ are the constant functions.

Returning to our discussion of approximating continuous real-valued functions on $[a, b]$ by polynomials, Jackson’s theorem provides a quantitative bound on the optimal error in terms of the modulus of continuity. If $f$ is a continuous function on $[a, b]$ , Jackson’s theorem gives that there exists a polynomial of degree $n$ with $∥ f - p ∥_{\infty} \leq 6 ω (\frac{b - a}{n}) .$ We can directly substitute the bounds from the definitions of Lipschitz or $α$ -Hölder continuous functions to get bounds for those cases.

5.3.7 Order notation

The modulus of continuity provides global control over functions on some domain. However, sometimes we are satisfied with much more local notions of control. To express these, it is useful to again turn to order notation, which we visited briefly in Section 3.2. The same notation can be used to compare functions $f (n)$ and $g (n)$ in a limit as $n \to \infty$ or to compare functions $f (ϵ)$ and $g (ϵ)$ in the limit as $ϵ \to 0$ . We are typically interested in upper bounds when dealing with calculus; that is, we care about

$f (ϵ) = O (g (ϵ))$ , i.e. $\exists C > 0, ρ > 0$ s.t. $\forall ϵ < ρ$ , $| f (ϵ) | < C g (ϵ)$ .
$f (ϵ) = o (g (ϵ))$ , i.e. $\forall C > 0, \exists ρ > 0$ s.t. $\forall ϵ < ρ$ , $| f (ϵ) | < C g (ϵ)$ .

From a less notationally messy perspective, we have $lim sup_{ϵ \to 0} | f (ϵ) | / g (ϵ) = {\begin{cases} C \geq 0, & for f (ϵ) = O (g (ϵ)) \\ 0, & for f (ϵ) = o (g (ϵ)), \end{cases}$ and when $lim_{ϵ \to 0} f (ϵ) / g (ϵ) = C \neq 0$ , we say $f (ϵ) = Θ (g (ϵ))$ . When $f : R \to V$ for some normed space $V$ , we abuse order notation slightly to write $lim sup_{ϵ \to 0} ∥ f (ϵ) ∥ / g (ϵ) = {\begin{cases} C \geq 0, & for f (ϵ) = O (g (ϵ)) \\ 0, & for f (ϵ) = o (g (ϵ)), \end{cases}$

We are frequently interested in the case of functions that are $O (ϵ^{r})$ for some integer or rational $r$ . If $f (ϵ) = Θ (ϵ^{r})$ is positive, then near zero we expect $\log (f (ϵ)) \approx \log (C ϵ^{r}) = r \log (ϵ) + \log (C)$ . We can check this condition for small $ϵ$ graphically with a log-log plot:

let
    # Should be O(ϵ^{1/2})
    f(ϵ) = maximum(roots(Polynomial([-ϵ, 0, 1, 1])))

    ϵs = 10.0.^(-10:-1)
    fs = f.(ϵs)
    println("Approximate slope: ",
             (log(fs[2])-log(fs[1]))/
             (log(ϵs[2])-log(ϵs[1])) )
    plot(ϵs, fs, xscale=:log10, yscale=:log10)
end

Approximate slope: 0.4999953048691424

This type of plot is, of course, somewhat heuristic: the asymptotic behavior as $ϵ \to 0$ should manifest for small enough $ϵ$ , but nothing says that $10^{- 10}$ (for example) is “small enough.” We can be more rigorous by including additional information, as we will discuss in later chapters.

5.4 Derivatives

We assume the reader is thoroughly familiar with differentiation in one dimension. Our focus in the rest of this section is on differentiation on functions between vector spaces.

5.4.1 Gateaux and Frechet

5.4.1.1 Functions from $R$ to $R$

The standard definition for the derivative of $f : R \to R$ is $f^{'} (x) = lim_{h \to 0} \frac{f (x + h) - f (x)}{h} .$ Alternately, we can write $f$ as $f (x + h) = f (x) + f^{'} (x) h + o (| h |) .$ Of course, not all functions are differentiable. When $f$ is differentiable at $x$ , we will sometimes also write $D f (x)$ to denote the derivative.

When the derivative exists and is continuous on some interval $(a, b)$ containing $x$ , we say $f$ is $C^{1}$ on $(a, b)$ . In this case, the fundamental theorem of calculus gives that for any $x + h \in (a, b)$ , $\begin{aligned} f (x + h) & = f (x) + \int_{0}^{h} f^{'} (x + ξ) d ξ \\ = f (x) + f^{'} (x) h + r (h) \\ r (h) & = \int_{0}^{h} (f^{'} (x + ξ) - f^{'} (x)) d ξ . \end{aligned}$ Here $r (h)$ is the remainder term for the first-order approximation $f (x) + f (x) h$ . If $ω_{f^{'}}$ is the modulus of continuity for $f^{'}$ on $(a, b)$ , then $∥ f^{'} (x + ξ) - f^{'} (x) ∥ \leq ω_{f^{'}} (| ξ |),$ and therefore $| r (h) | \leq \int_{0}^{| h |} ω_{f^{'}} (ξ) d ξ .$ When $f^{'}$ is Lipschitz with constant $C$ , we have $| r (h) | \leq \int_{0}^{| h |} ω_{f^{'}} (| ξ |) d ξ \leq \int_{0}^{| h |} C ξ d ξ = \frac{1}{2} C h^{2} .$ Hence, in regions where $f^{'}$ is not only continuous but Lipschitz, the error $r (h)$ in the linearized approximation is not only $o (| h |)$ , but it is even $O (h^{2})$ .

When a differentiable function $f$ is only available by evaluation, we can approximate the derivative at $x$ by finite differences, e.g. taking $f^{'} (x) \approx \frac{f (x + h) - f (x)}{h}$ where $h$ is “small enough.” More precisely, if $f$ is $C^{1}$ on an interval containing $(a, b)$ , we have $\frac{f (x) + f^{'} (x) h + r (h) - f (x)}{h} = f^{'} (x) + \frac{r (h)}{h},$ where $r$ is the remainder for the linear approximation about $x$ . When $f^{'}$ is Lipschitz with constant $C$ , the error in this finite difference approximation is therefore $\frac{| r (h) |}{| h |} \leq \frac{1}{2} C | h | = O (| h |) .$ We further analyze how the size of $h$ affects accuracy, as well as other methods for approximating derivatives, in Chapter 12. For this chapter, though, it will still be useful to sanity check derivatives via finite differences:

finite_diff(f, x; h=1e-8) = (f(x+h)-f(x))/h

5.4.1.2 Functions from $R$ to $W$

Suppose $g : R \to W$ for some normed linear space $W$ . We can define the derivative of $g$ almost identically to the derivative for a function from $R$ to $R$ , i.e. $g (x + h) = g (x) + g^{'} (x) h + r (h), ∥ r (h) ∥ = o (| h |) .$ The only thing that differs is that we are controlling the norm of $r (h)$ rather than the absolute value. Also similarly to the real case, if $g^{'}$ is Lipschitz with constant $C$ , then $∥ r (h) ∥ \leq \frac{1}{2} C h^{2},$ and we can bound the error in simple finite difference estimates of the derivative as we do in the case of functions from $R$ to $R$ .

There are a handful of theorems that are specific to real-valued functions, like the mean value theorem (Section 5.4.2). These theorems usually still tell us interesting things about vector valued functions by considering functions $x \mapsto w^{*} g (x)$ where $w^{*} \in W^{*}$ is some dual vector.

5.4.1.3 Functions from $V$ to $R$

Suppose $f : V \to R$ for some normed linear space $V$ . Then for any point $x \in V$ and nonzero vector $u \in V$ , we can define a function $f \circ c$ from $R$ to $R$ that evaluates $f$ along a ray $c_{x, u} (s) = x + s u$ . Assuming it exists, the derivative $(f \circ c_{x, u})^{'}$ is the Gateaux derivative (directional derivative) of $f$ at a point $x$ and in a direction $u$ : $D [f (x); u] = (f \circ c_{x, u})^{'} (0) .$ Alternately, the Gateaux derivative is the number such that $f (x + s u) = f (x) + s D [f (x); u] + o (s) .$ We can estimate the Gateaux derivative by finite differencing in the given direction

finite_diff(f, x, u; h=1e-8) = (f(x+h*u)-f(x))/h

When the Gateaux derivative of $f$ at $x$ is defined for all $u \in V$ , we say $f$ is Gateaux differentiable at $x$ .

When all the Gateaux derivatives are continuously defined in a neighborhood of $v$ , we say $f$ is $C^{1}$ on some open set containing $v$ . In this case, Gateaux derivatives are related by the Frechet derivative, a functional $f^{'} (x) \in V^{*}$ (also written $D f (x)$ ) such that $D [f (x); u] = f^{'} (x) u$ or, equivalently, for any direction $u$ , $f (x + s u) = f (x) + s f^{'} (x) u + o (s) .$ When the Frechet derivative is Lipschitz with some constant $C$ in a consistent norm, we can bound the remainder term by $C s^{2} / 2$ , as in the one-dimensional case. Frechet differentiability is a stronger condition than Gateaux differentiability. When we say a function is “differentiable at $x$ ” without qualifications, we mean it is Frechet differentiable.

The Frechet derivative (or just “the derivative”) $f^{'} (x)$ is an element in $V^{*}$ . In a (real) inner product space, the gradient $\nabla f (x) \in V$ is the dual of $f^{'} (x)$ given by the Riesz map, i.e. $f^{'} (x) u = ⟨ \nabla f (x), u ⟩ .$ In $R^{n}$ with the standard inner product, the gradient (a column vector) is the transpose of the derivative (a row vector).

5.4.1.4 A polynomial example

Following our pattern from the previous chapter, we consider an example in a polynomial space. Let $f : P_{d} \to R$ be given by $f (p) = \frac{1}{2} \int_{- 1}^{1} p (x)^{2} d x$ . The Frechet derivative of $f$ (which is a function of the polynomial $p \in P_{d}$ , not the indeterminate $x$ ) is then the functional $f^{'} (p)$ such that the action on a vector $q \in P_{d}$ is $f^{'} (p) q = \int_{- 1}^{1} p (x) q (x) d x .$ With respect to the power basis, we can write $f^{'} (q)$ in terms of the row vector $d^{T} = [\begin{matrix} f^{'} (p) x^{0} & f^{'} (p) x^{1} & \dots & f^{'} (p) x^{d} \end{matrix}] .$ The gradient with respect to the $L^{2} ([- 1, 1])$ inner product is then $\nabla f (p) = X_{0 : d} M^{- 1} d$ where $M$ is the Gram matrix for the power basis.

We can compute with $f^{'} (p)$ by calling the integrate command or through the power basis. We can also compute with $f^{'} (p)$ by writing it in terms of a basis of $P_{d}^{*}$ in terms of point evaluation functionals, i.e. $f^{'} (p) = c^{T} W^{*}, W^{*} = [\begin{matrix} δ_{x_{0}}^{*} \\ ⋮ \\ δ_{x_{d}}^{*} \end{matrix}] .$ We can also compute the $c$ coefficients by solving $c^{T} A = d^{T}$ where $A$ is the Vandermonde matrix $A = W^{*} X_{0 : d} = [\begin{matrix} 1 & x_{0}^{1} & \dots & x_{0}^{d} \\ 1 & x_{1}^{1} & \dots & x_{1}^{d} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{d}^{1} & \dots & x_{d}^{d} \end{matrix}] .$

We can sanity check this against finite differences for a random pair of polynomials:

let
    # Functions for f and the Frechet derivative f'(p)
    f(p) = integrate(0.5*p*p, -1, 1)
    Df(p) = q -> integrate(p*q, -1, 1)

    # Express f'(p) in the power basis
    d(p) = [Df(p).(Polynomial(ej)) for ej in eachcol(Matrix(I,4,4))]'

    # Gram matrix for the L2 inner product
    m(k) = k % 2 == 1 ? 0.0 : 2/(k+1)
    M = [m(i+j) for i=0:3, j=0:3]

    # Compute the gradient
    ∇f(p) = Polynomial(M\d(p)')

    # Vandermonde matrix A for four equispaced points
    x = range(-1, 1, length=4)
    A = [xi^j for xi=x, j=0:3]

    # Compute c in terms of point evaluations
    c(p) = d(p)/A

    # Test everything
    ptest = Polynomial(rand(4))
    qtest = Polynomial(rand(4))
    Dfpq_fd = finite_diff(f, ptest, qtest)     # Finite diff
    Dfpq1 = Df(ptest)(qtest)                   # Analytic form
    Dfpq2 = c(ptest)*qtest.(x)                 # Via evaluation fnls
    Dfpq3 = integrate(∇f(ptest)*qtest, -1, 1)  # Via gradient

    # Compare -- everything should be equal (some approx for fd)
    isapprox(Dfpq1, Dfpq_fd, rtol=1e-6) &&
    Dfpq1 ≈ Dfpq2 &&
    Dfpq1 ≈ Dfpq3

end

true

5.4.1.5 General mappings

Now consider $f : V \to W$ where $V$ and $W$ are normed linear spaces. In this case, the Gateaux derivative in a direction $u \in V$ is a vector in $W$ , still given as $D [f (v); u] = (f \circ c_{v, u})^{'} (0) for c (s) = x + s u$ and when there is a Frechet derivative $D f (v) = f^{'} (v) \in L (V, W)$ we have $f (v + u) = f (v) + f^{'} (v) u + o (∥ u ∥),$ and the Gateaux derivatives satisfy $D [f (v); u] = f^{'} (v) u .$ When the Frechet derivative is Lipschitz with constant $C$ (with respect to the induced norm), then the remainder for the linear approximation is bounded by $C ∥ u ∥^{2} / 2$ . The representation of the Frechet derivative with respect to a particular basis is the Jacobian matrix $J = W^{- 1} f^{'} (p) V$ .

5.4.2 Mean values

For one-dimensional continuously differentiable functions $f : Ω \subset R \to R$ for some open connected $Ω$ , the fundamental theorem of calculus tells us that for $a, b \in Ω$ we have $f (b) - f (a) = \int_{0}^{1} f^{'} (z b + (1 - z) a) (b - a) d z .$ The mean value theorem tells us that for some intermediate $c \in [a, b]$ ( $c = ξ b + (1 - ξ) a$ for $ξ \in [0, 1]$ ), we have $\begin{array}{r} f^{'} (c) (b - a) = \int_{0}^{1} f^{'} (z b + (1 - z) a) (b - a) d z . \end{array}$ The fundamental theorem of calculus holds even when $f$ has discontinuities in the derivative at some points, but the second equation relies on continuity of the derivative. This useful theorem generalizes to the vector space case in various ways.

For continuously differentiable functions $f : Ω \subset V \to R$ where $Ω$ is open and convex (i.e. the line segment connecting $a$ and $b$ lies within $Ω$ ), we have $\begin{aligned} f (b) - f (a) & = \int_{0}^{1} f^{'} (z b + (1 - z) a) (b - a) d z \\ = f^{'} (ξ b + (1 - ξ) a) (b - a) . \end{aligned}$ for some $ξ \in [0, 1]$ . And for continuously differentiable functions $g : Ω \subset V \to W$ where $Ω$ is open and convex, we have that for any $w^{*} \in V^{*}$ we can apply the mean value theorem to the real-valued $f (v) = w^{*} g (v)$ ; that is, there is a $ξ \in [0, 1]$ such that $\begin{aligned} w^{*} (g (b) - g (a)) & = \int_{0}^{1} w^{*} g (z b + (1 - z) a) (b - a) d z \\ = w^{*} g^{'} (ξ b + (1 - ξ) a) (b - a) . \end{aligned}$ However, the choice of $ξ$ depends on $w^{*}$ , so it is not true that we can fully generalize the mean value theorem to maps between vector spaces. On the other hand, we can get something that is almost as good, at least for the purpose of proving bounds used in numerical computing, by choosing $w^{*}$ to be a vector such that $∥ w^{*} ∥ = 1$ (using the dual norm to whatever norm we are using for $W$ ) and $w^{*} (g (b) - g (a)) = ∥ g (b) - g (a) ∥$ . With this choice, and consistent choices of norms, we have that for $\begin{aligned} ∥ g (b) - g (a) ∥ & = \int_{0}^{1} w^{*} g^{'} (z b + (1 - z) a) (b - a) d z \\ \leq \int_{0}^{1} ∥ w^{*} ∥ ∥ g^{'} (z b + (1 - z) a) ∥ ∥ b - a ∥ d z \\ \leq (\int_{0}^{1} ∥ g^{'} (z b + (1 - z) a) ∥ d z) ∥ b - a ∥ . \end{aligned}$ We can also bound the average norm of the derivative on the segment from $a$ to $b$ by $\int_{0}^{1} ∥ g^{'} (z b + (1 - z) a) ∥ d z \leq max_{z \in [0, 1]} ∥ g^{'} (z b + (1 - z) a) ∥ .$ When $∥ g^{'} (x) ∥ \leq C$ for all $x \in Ω$ , the function $g$ is Lipschitz with constant $C$ on $Ω$ .

5.4.3 Chain rule

Suppose $f : V \to W$ and $g : U \to V$ are affine functions, i.e. $\begin{aligned} g (x) & = g_{0} + J_{g} x \\ f (y) & = f_{0} + J_{f} y . \end{aligned}$ Then $h = f \circ g$ is another affine map $h (x) = f (g (x)) = f (g (0)) + J_{f} J_{g} x .$ The chain rule is just a generalization from affine maps to functions well approximated by affine maps (i.e. differentiable functions).

Now let $f : V \to W$ and $g : U \to V$ be functions where $g$ is differentiable at $u$ and $f$ is differentiable at $v = g (u)$ . Then $\begin{aligned} g (u + x) & = g (u) + g^{'} (u) x + r_{g} (x), & r_{g} (x) & = o (∥ x ∥), \\ f (v + y) & = f (v) + f^{'} (u) y + r_{f} (y), & r_{f} (y) & = o (∥ y ∥) . \end{aligned}$ Let $h = f \circ g$ ; then setting $y = g^{'} (u) x + r_{g} (x) = O (∥ x ∥)$ , we have $\begin{aligned} h (u + x) & = f (v) + f^{'} (v) y + r_{f} (y) \\ = f (v) + f^{'} (v) g^{'} (u) x + r_{h} (x), \\ r_{h} (x) & = g^{'} (u) r_{g} (x) + r_{f} (g^{'} (u) x + r_{g} (x)) = o (∥ x ∥) . \end{aligned}$ That is, $h$ is differentiable at $u$ with $h^{'} (u) = f^{'} (v) g^{'} (u)$ .

5.4.3.1 A polynomial example

Now consider $h (p) = \frac{1}{2} \int_{- 1}^{1} {(\frac{d p}{d x} - ϕ (x))}^{2} d x,$ which we can see as the composition of the functional $f (r) = \frac{1}{2} \int_{- 1}^{1} r (x)^{2} d x$ which we analyzed in the previous section, together with $r (p) = \frac{d p}{d x} - ϕ (x) .$ The function $r (p)$ is affine in $p$ , and the derivative $r^{'} (p)$ is simply the differentiation operator $\frac{d}{d x}$ . We already saw that $f^{'} (r) q = \int_{- 1}^{1} r (x) q (x) d x .$ Therefore, by the chain rule we have $h^{'} (p) q = f^{'} (r) r^{'} (p) q = \int_{- 1}^{1} r (x) \frac{d q}{d x} (x) d x .$ As is often the case, working code is a useful check to make sure that we understand what is happening with the mathematics:

let
    ϕ = Polynomial(rand(4))
    r(p) = derivative(p) - ϕ
    f(r) = 0.5*integrate(r*r, -1, 1)
    h(p) = f(r(p))
    Dh(p) = q -> integrate(r(p)*derivative(q), -1, 1)
    
    ptest = Polynomial(rand(5))
    qtest = Polynomial(rand(5))
    isapprox(Dh(ptest)(qtest),
             finite_diff(h, ptest, qtest), rtol=1e-6)
end

true

5.4.4 Partial derivatives

Consider a Frechet-differentiable function $f : V \oplus U \to W$ . For $x \in V \oplus U$ , we can write the Frechet derivative as quasimatrix with respect to the two components of the domain space: $D f (x) = [\begin{matrix} (D f (x))_{1} & (D f (x))_{2} \end{matrix}],$ where $(D f (x))_{1} \in L (V, W)$ and $(D f (x))_{2} \in L (U, W)$ . However, it is often more convenient to “unpack” the input to the function into different components, i.e. thinking of $f : V \times U \to W$ . We similarly unpack $y$ into a $V$ component $y_{1}$ and a $U$ component $y_{2}$ . Then we write $D f (x_{1}, x_{2}) (y_{1}, y_{2}) = D_{1} f (x_{1}, x_{2}) y_{1} + D_{2} f (x_{1}, x_{2}) y_{2},$ where the maps $D_{1} f (x_{1}, x_{2}) \in L (V, W)$ and $D_{2} f (x_{1}, x_{2}) \in L (U, W)$ are the same as $(D f (x))_{1}$ and $(D f (x))_{2}$ above. We refer to $D_{1} f$ and $D_{2} f$ as the partial derivatives of $f$ with respect to the first and second argument.

More generally, if $f$ is a Frechet-differentiable function with $m$ arguments, we write the quasimatrix expression $\begin{aligned} D f (x_{1}, \dots, x_{m}) (y_{1}, \dots, y_{m}) \\ = [\begin{array}{c} D_{1} f (x_{1}, \dots, x_{m}) & \dots & D_{m} f (x_{1}, \dots, x_{m}) \end{array}] [\begin{array}{c} y_{1} \\ ⋮ \\ y_{m} \end{array}] \\ = \sum_{j} D_{j} f (x_{1}, \dots, x_{m}) y_{j} . \end{aligned}$ Though this may look like a regular matrix expression, we are deliberately not assuming that the arguments have the same type. Indeed, the arguments may belong to wildly different spaces: the first component might belong to $P_{d}$ , the second to $R^{n}$ , and so on. Nonetheless, we can formally put a (normed) vector space structure on the whole list taken as a direct sum of the argument spaces, and then treat the partial derivatives $D_{j} f$ as pieces of the overall derivative $D f$ .

A common source of confusion in partial differentiation comes from implicit function composition. For example, consider $f : V \to R$ where $V$ is a two dimensional space with bases $V = [\begin{matrix} v_{1} & v_{2} \end{matrix}]$ and $U = [\begin{matrix} u_{1} & u_{e} \end{matrix}]$ . Then we can write $\begin{aligned} h_{V} (α, β) & = (f \circ g_{V}) (α, β), & g_{V} (α, β) & = v_{1} α + v_{2} β, \\ h_{U} (a, b) & = (f \circ g_{U}) (a, b), & g_{U} (a, b) & = u_{1} a + u_{2} b . \end{aligned}$ In this notation, the partials are clearly defined: $\begin{aligned} D_{k} h_{V} (α, β) & = f^{'} (g_{V} (α, β)) v_{k}, \\ D_{k} h_{U} (α, β) & = f^{'} (g_{U} (α, β)) v_{k} . \end{aligned}$ In contrast, classical notation often conflates $f$ , $f \circ g_{V}$ , and $f \circ g_{U}$ :

Classical notation	Our notation
$\partial f / \partial α$	$D_{1} (f \circ g_{V})$
$\partial f / \partial β$	$D_{2} (f \circ g_{V})$
$\partial f / \partial a$	$D_{1} (f \circ g_{U})$
$\partial f / \partial b$	$D_{2} (f \circ g_{U})$

We will sometimes use symbolic labels rather than indices to refer to arguments to a function. For example, we might write $D_{u} f (u, v)$ instead of $D_{1} f (u, v)$ . This approach is only sensible when the symbolic label unambiguously refers to an argument of $f$ , and we must treat it carefully. When there is a danger of ambiguity, we will bend our notation to make clear that we are labeling a slot. For example, we can write $D_{u} f (u = a, v = b)$ to refer to $D_{1} f (a, b)$ .

5.4.5 Implicit functions

Now suppose $V$ and $W$ are spaces of equal dimension, and consider a function $f : V \times U \to W$ that is continuously differentiable on some open set including a point $(v, u)$ for which $f (v, u) = 0$ . We will write the Frechet derivative of $f$ in quasimatrix form as $f^{'} (v, u) = [\begin{matrix} D_{1} f (v, u) & D_{2} f (v, u) \end{matrix}],$ where $D_{1} f$ is the piece of the map associated with $v$ and $D_{2} f$ is the piece associated with $u$ . If $D_{1} f (v, u)$ ) is invertible, the the implicit function theorem says that there exists a function $g : Ω \subset U \to V$ defined on an open set containing $u$ and such that $g (u) = v$ and $f (g (z), z) = 0$ for $z$ in an open neighborhood about $u$ . By the chain rule, we must satisfy $D_{1} f (g (z), z) g^{'} (z) + D_{2} f (g (z), z) = 0.$ Therefore, $g^{'} (z)$ can be computed as $g^{'} (z) = - {(D_{1} f (g, z))}^{- 1} D_{2} f (g, z),$ where we have suppressed $z$ as an argument to $g$ to simplify notation.

Often we are interested not in $g (z)$ , but some $h (g (z))$ that might lie in a much lower dimensional space. If everything is differentiable, then by the chain rule we have $(h \circ g)^{'} (z) = - h^{'} (g (z)) [{(D_{1} f (g, z))}^{- 1} D_{2} f (g, z)] .$ We can use associativity to rewrite $(h \circ g)^{'} (z)$ as $\begin{aligned} {\bar{g}}^{*} & = - h^{'} (g (z)) {(D_{1} f (g, z))}^{- 1} \\ (h \circ g)^{'} (z) & = {\bar{g}}^{*} D_{2} f (g, z), \end{aligned}$ where ${\bar{g}}^{*} \in V^{*}$ is a dual variable. Computing ${\bar{g}}^{*}$ is sometimes called an adjoint solve, since in an inner product space this is equivalent to solving a linear system with the adjint of the derivative operator: $\bar{g} = {(D_{1} f (g, z))}^{- *} \nabla h (g (z)) .$

5.4.6 Variational notation

We will often use a concise notation for differential calculus sometimes called variational notation (as in “calculus of variations”). If $f$ is differentiable at $x$ and $y = f (x)$ then in variational notation we would write $δ y = f^{'} (x) δ x .$ Here $δ x$ and $δ y$ are read as “variation in $x$ ” and “variation in $y$ ,” and we describe the process of differentiating as “taking variations of $y = f (x)$ .” If we think of differentiable functions $\tilde{x} (s)$ and $\tilde{y} (s)$ with $x = \tilde{x} (0)$ and $y = \tilde{y} (0)$ , then $δ x = {\tilde{x}}^{'} (0)$ and $δ y = {\tilde{y}}^{'} (0)$ . Put differently, $\begin{aligned} \tilde{x} (ϵ) & = x + ϵ δ x + o (ϵ), \\ \tilde{y} (ϵ) & = y + ϵ δ y + o (ϵ), \end{aligned}$ and similarly for other related quantities. If $x$ and $y$ have types that can be added, scaled, and multiplied together, we also have the standard rules $\begin{aligned} δ (x + y) & = δ x + δ y \\ δ (α x) & = α δ x \\ δ (x y) & = δ x y + x δ y . \end{aligned}$ This last is derived in the way that we usually do: $(x + ϵ δ x + o (ϵ)) (y + ϵ δ y + o (ϵ)) = x y + ϵ (δ x y + x δ y) + o (ϵ) .$ We note that this is true even for things like linear maps of the right types, which can be multiplied but do not commute.

Variational notation is sometimes tidier than other ways of writing derivative relationships. For example, suppose $A \in L (V, V)$ is an invertible linear map and we are interested in the variation in $B = A^{- 1}$ with respect to the variation in $A$ . We can see $B$ as a differentiable function of $A$ by the implicit function theorem on the equation $A B - I = 0.$ Taking variations of this relationship, we have $(δ A) B + A (δ B) = 0,$ which we can rearrange to $δ B = - A^{- 1} (δ A) A^{- 1} .$ The formula for taking variations of $A^{- 1}$ generalizes the 1D formula $(x^{- 1})^{'} = - x^{- 2}$ , which is often used as a first example of implicit differentiation in introductory calculus courses. A short computation verifies that our formula agrees with a finite difference estimate for a small example.

let
    A =  [ 94.0  29.0  26.0 ;
           65.0  25.0  66.0 ;
           85.0  92.0  72.0 ]
    δA = [ 57.0  15.0  46.0 ;
           56.0  44.0  79.0 ;
           42.0  45.0  48.0 ]
    δB = -A\δA/A
    δB_fd = finite_diff(inv, A, δA)
    isapprox(δB, δB_fd, rtol=1e-6)
end

true

We could work out the same formula with the notation introduced in the previous sections, but the version involving variational notation is arguably easier to read and certainly shorter to write.

5.4.7 Adjoints

Consider a map $μ$ taking an vector input $x$ to an scalar output $y$ via a vector intermediates $u$ and $v$ , written in terms of the relations $\begin{aligned} u & = f (x) \\ v & = g (x, u) \\ y & = h (x, u, v) \end{aligned}$ Then we have the derivative relationships $\begin{aligned} δ u & = D_{1} f δ x \\ δ v & = D_{1} g δ x + D_{2} g δ u \\ δ y & = D_{1} h δ y + D_{2} g δ u + D_{3} g δ v \end{aligned}$ We can rearrange this into the (quasi)matrix form $[\begin{matrix} 1 & 0 & 0 & 0 \\ - D_{1} f & 1 & 0 & 0 \\ - D_{1} g & - D_{2} g & 1 & 0 \\ - D_{1} h & - D_{2} h & - D_{3} h & 1 \end{matrix}] [\begin{matrix} δ x \\ δ u \\ δ v \\ δ y \end{matrix}] = [\begin{matrix} I \\ 0 \\ 0 \\ 0 \end{matrix}] δ x .$ That is, we the computation of $δ y$ and then $δ z$ from $δ x$ as an example of forward substitution for a 4-by-4 linear system.

If we truly only care about the relationship between $x$ and $y$ , there is no reason why we should explicitly compute the intermediate $δ y$ . An equally good way to solve the problem is to solve the adjoint equation discussed in Section 5.4.5: $[\begin{matrix} {\bar{x}}^{*} & {\bar{u}}^{*} & {\bar{v}}^{*} & {\bar{y}}^{*} \end{matrix}] [\begin{matrix} 1 & 0 & 0 & 0 \\ - D_{1} f & 1 & 0 & 0 \\ - D_{1} g & - D_{2} g & 1 & 0 \\ - D_{1} h & - D_{2} h & - D_{3} h & 1 \end{matrix}] = [\begin{matrix} 0 & 0 & 0 & 1 \end{matrix}] .$ via backward substitution on the 4-by-4 linear system: $\begin{aligned} {\bar{y}}^{*} & = 1 \\ {\bar{v}}^{*} & = {\bar{y}}^{*} D_{3} h \\ {\bar{u}}^{*} & = {\bar{v}}^{*} D_{2} g + {\bar{y}}^{*} D_{2} h \\ {\bar{x}}^{*} & = {\bar{u}}^{*} D_{1} h + {\bar{v}}^{*} D_{1} g + {\bar{y}}^{*} D_{1} h . \end{aligned}$ We observe that $δ y = {\bar{x}}^{*} δ x;$ that is, ${\bar{x}}^{*}$ is the derivative of the map from $x$ to $z$ .

While the variations $δ y$ and $δ z$ lie in the same space as $y$ and $z$ , the dual variables ${\bar{u}}^{*}, {\bar{v}}^{*}, {\bar{y}}^{*}$ (also called adjoint variables) live in the associated dual spaces. The dual variables can also be interpreted as the sensitivity of the output $y$ to the associated primary variable, assuming earlier variables were held constant.

The idea of using dual variables to compute derivatives of a function with many intermediates (rather using variations as intermediate quantities) is the basis of adjoint mode or reverse mode automatic differentiation, which we will discuss further in Chapter 11.

5.4.8 Higher derivatives

5.4.8.1 Maps from $R$ to $R$

A main reason — if not the main reason — why we care about higher derivatives is because of their role in approximation.

As a particular starting point, consider $g : Ω \subset R \to R$ on an interval $Ω$ . Assuming $g$ has at least $k + 1$ continuous derivatives (i.e. $g \in C^{k + 1}$ ), then Taylor’s theorem with integral remainder gives $g (t) = \sum_{j = 0}^{k} \frac{1}{j!} g^{(j)} (t) + r_{k + 1} (t),$ where $r_{k + 1} (t) = \int_{0}^{t} \frac{1}{k!} (t - s)^{k} g^{(k + 1)} (s) d s .$ This formula can be verified with integration by parts. By the mean value theorem, there is some $ξ \in (0, t)$ such that $r_{k + 1} (t) = \frac{t^{k + 1}}{(k + 1)!} g^{(k + 1)} (ξ) .$ This is the mean value form of the remainder.

We will sometimes work with slightly less regular functions, e.g. $g : R \to R$ with $k$ continuous derivatives and $g^{(k)}$ Lipschitz with constant $C$ (which implies absolute continuity of $g^{(k)}$ ). In this case, we do not have the mean value form of the remainder, but can still use the integral form to get the bound $| r_{k + 1} (t) | \leq \frac{C | t |^{k + 1}}{(k + 1)!} .$ Even for $C^{k + 1}$ functions, these types of bounds are sometimes easier to work with than the mean value form of the remainder.

5.4.8.2 Maps from $V$ to $R$

If $f : V \to R$ is differentiable in an open set $Ω$ containing $x$ , the Frechet derivative is a function $f^{'} : V \to V^{*} = L (V, R)$ where $f^{'} (x) \in V^{*}$ maps directions to directional derivatives. If $f^{'}$ is differentiable, then $f^{″} : V \to L (V, V^{*}) .$ The mappings $f^{″} (x)$ from $V$ to $V^{*}$ can also be thought of as a bilinear form from two vectors in $V$ to $R$ , i.e. $(u, v) \mapsto (f^{″} (x) u) v$ . Hence, we write $f^{″} : V \to L (V \otimes V, R) .$ When $f^{″}$ is continuously defined in some neighborhood of $x$ , the bilinear form is also guaranteed to be symmetric, i.e. $f^{″} (x) (u \otimes v) = f^{″} (x) (v \otimes u) .$ As discussed in Chapter 4, there is a 1-1 correspondence between symmetric bilinear forms and quadratic forms, and we will mostly be interested in $f^{″} (x)$ as representing the quadratic $u \mapsto f^{″} (x) (u \otimes u)$ .

The matrix representation of $f^{″} (x)$ with respect to a particular basis $V$ is called the Hessian matrix, and is sometimes written $H_{f} (x)$ . The entries of this matrix are $(H_{f} (x))_{i j} = f^{″} (x) (v_{i} \otimes v_{j})$ , so that we write evaluation of the Hessian on a pair of vectors as $f^{″} (x) (V c \otimes V d) = d^{*} H_{f} (x) c .$ In most cases, we are interested in evaluating the quadratic form associated with $f^{″} (x)$ , i.e. $f^{″} (x) (V c \otimes V c) = c^{*} H_{f} (x) c .$ Hessian matrices will play a central role in our discussion of numerical optimization methods.

In thinking about first derivatives, we started by considering differentiation along a straight ray. We will do the same thing when thinking about using second derivatives. Suppose $c_{x, u} (s) = x + s u$ and $c_{x, u} ([0, t]) \in Ω$ , and let $g = f \circ c_{x, u}$ . Suppose $f \in C^{2} (Ω)$ where $f^{″}$ is Lipschitz with constant $M$ (with respect to an induced norm); that is: $\begin{aligned} ∥ f^{″} (x) ∥ & = max_{∥ w ∥ = 1} | f^{'} (x) (w \otimes w) | \\ ∥ f^{″} (x) - f^{″} (y) ∥ & \leq M ∥ x - y ∥ . \end{aligned}$ Then Taylor’s theorem with remainder gives $g (t) = g (0) + g^{'} (0) t + \frac{1}{2} g^{″} (0) t^{2} + r (t), | r (t) | \leq \frac{M ∥ u ∥^{3}}{6} t^{3} .$ where $\begin{aligned} g^{'} (0) & = f^{'} (0) u \\ g^{″} (0) & = f^{″} (0) (u \otimes u) . \end{aligned}$ We usually skip the intermediate function $g$ , and write that when $f^{″}$ is Lipschitz and $Ω$ is convex (so that the line segment connecting $x, u \in Ω$ will lie entirely within $Ω$ ) then $f (x + u) = f (x) + f^{'} (x) u + \frac{1}{2} f^{″} (x) (u \otimes u) + O (∥ u ∥^{3}),$ where we are only dealing with the asymptotics of the third-order term. When dealing with a concrete space $R^{n}$ (or when working in terms of a basis for $V$ ), we usually write $f (x + u) = f (x) + f^{'} (x) u + \frac{1}{2} u^{*} H_{f} (x) u + O (∥ u ∥^{3}) .$

One can continue to take higher derivatives in a similar manner — e.g., if $f^{″}$ is differentiable, then $f^{‴} (x)$ is a trilinear form in $L (V \otimes V \otimes V, R)$ , and we represent it with respect to a basis $V$ as a concrete tensor with entries that we usually write as $f_{, i j k} = f^{‴} (x) (v_{i} \otimes v_{j} \otimes v_{k})$ . In the rare cases when we use more than second derivatives, we often revert to indicial notation for computing with functions on concrete spaces; with the summation convention in effect, the Taylor series becomes $f (x + u) = f (x) + f_{, i} (x) u_{i} + \frac{1}{2} f_{, i j} (x) u_{i} u_{j} + \frac{1}{6} f_{, i j k} (x) u_{i} u_{j} u_{k} + \dots .$ The size and notational complexity goes up significantly as we move beyond second derivatives, enough so that we avoid higher derivatives if we can.

5.4.8.3 General maps

We now consider the case of $f : V \to W$ . If $f$ is $C^{2}$ on $Ω$ , then the second derivative is $f^{″} : V \to L (V, L (V, W)) \equiv L (V \otimes V, W),$ and the bilinear map $f^{″} (x)$ is symmetric in the arguments, i.e. $f^{″} (x) (u \otimes v) = f^{″} (x) (v \otimes u) .$ Expanding to second order with remainder again looks like $f (x + u) = f (x) + f^{'} (x) u + \frac{1}{2} f^{″} (x) (u \otimes u) + O (∥ u^{3} ∥) .$ But when we want to compute with concrete quantities, we need three indices even to keep track of the second derivative term: with respect to bases $V$ and $W$ , we write the concrete tensor $f_{i, j k} = {[W^{- 1} f (x) (v_{j} \otimes v_{k})]}_{i}$ or, equivalently (using indices and the summation convention), $f (x) (V c \otimes V d) = w_{i} f_{i, j k} c_{j} d_{k} .$

5.4.9 Analyticity

So far, we have considered functions on real vector spaces. But for some of what we have to say about matrix calculus later, it will be useful to also consider functions on the complex plane.

We begin with an algebra fact. Suppose that $z = x + i y$ and $c = a + i b$ are complex numbers written in rectangular form; then $c z = [\begin{matrix} 1 & i \end{matrix}] [\begin{matrix} a & - b \\ b & a \end{matrix}] [\begin{matrix} x \\ y \end{matrix}] .$ That is, the matrix mapping the real and imaginary components of $z$ to the real and imaginary components of $c z$ has the form $a I + b J$ where $I$ is the identity and $J$ is the 2-by-2 skew matrix $J = [\begin{matrix} 0 & - 1 \\ 1 & 0 \end{matrix}] .$ Hence, we can interpret complex multiplication as a 2-by-2 matrix on the rectangular coordinate parameterization of the reals — but it is a matrix of a very specific type.

Now let $f : C \to C$ . We would like to say that $f$ is differentiable at $z$ if $f (z + w) = f (z) + f^{'} (z) w + o (| w |) .$ Phrased in terms of rectangular coordinates ( $f = g + i h$ , $z = x + i y$ , $w = u + i v$ ), we have that $[\begin{matrix} D_{1} g (x, y) & D_{2} g (x, y) \\ D_{1} h (x, y) & D_{2} h (x, y) \end{matrix}] = [\begin{matrix} a (x, y) & - b (x, y) \\ b (x, y) & a (x, y) \end{matrix}]$ i.e. the partial derivatives of the real and imaginary components of $f$ must satisfy the Cauchy-Riemann equations $D_{1} g = D_{2} h, D_{2} g = - D_{1} h .$ A function that is complex differentiable on an open set $Ω$ is called holomorphic (or (complex) analytic) on that set.

Holomorphic functions are about as nice as functions can be. Among other nice properties:

They are automatically infinitely (complex) differentiable,
They have convergent Taylor series about any point in the domain $Ω$ ,
The real and imaginary components are harmonic functions (i.e. $D_{1}^{2} g + D_{2}^{2} g = 0$ , and similarly for $h$ ), and
Contour integration around any loop $Γ$ whose interior is wholly within $Ω$ gives $\int_{Γ} f (z) d z = 0$ .

We will heavily rely on this last fact (and related facts) when dealing with the contour integral view of functions of matrices in Section 5.7.

5.5 Series

A series is an infinite sequence of terms $v_{1}, v_{2}, \dots$ that are (formally) added together. When students first learn about series in introductory calculus courses, the terms $v_{j}$ are typically real or complex numbers. We will consider the more general case where the $v_{j}$ are drawn from some normed vector space $V$ , which we will often assume is complete (a Banach space). The partial sums are $S_{n} = \sum_{j = 1}^{n} v_{j};$ and when $V$ is a Banach space and the partial sums form a Cauchy sequence, we say the series converges to $S = \sum_{j = 1}^{\infty} v_{j} = lim_{n \to \infty} S_{n} .$ Series that do not converge (divergent series) are still useful in many situations; we simply need to understand that they are to be treated formally, and we might not be able to make sense of the limiting value.

5.5.1 Convergence tests

Many of the standard convergence tests for series in $R$ or $C$ generalize to series in a Banach space. The comparison test is of particular use:

Theorem 5.2 (Comparison test) Suppose ${v_{j}}_{j = 1}^{\infty}$ is a sequence in a Banach space $V$ , where the terms are bounded by $∥ v_{j} ∥ \leq a_{j}$ . Suppose $\sum_{j = 1}^{\infty} a_{j}$ converges. Then $\sum_{j = 1}^{\infty} v_{j}$ also converges absolutely; moreover, $‖ \sum_{j = 1}^{\infty} v_{j} ‖ \leq \sum_{j = 1}^{\infty} a_{j} .$

Proof. Let $A_{n}$ and $A$ denote the $n$ th partial sum and the limit for the series $\sum_{j = 1}^{\infty} a_{j}$ , and let $S_{n}$ denote the $n$ th partial sum for the series $\sum_{j = 1}^{\infty} v_{j}$ . By the triangle inequality, for any $i \geq j$ , we have $∥ S_{j} - S_{i} ∥ = ‖ \sum_{k = i + 1}^{j} v_{k} ‖ \leq \sum_{k = i + 1}^{j} ∥ v_{k} ∥ \leq A_{j} - A_{i} .$ The fact that the partial sums $A_{n}$ form a Cauchy sequence means that for any $ϵ > 0$ , there exists $N$ such that if $i \geq j \geq N$ , $| A_{j} - A_{i} | < ϵ;$ and because $∥ S_{j} - S_{i} ∥ \leq | A_{j} - A_{i} |$ , the partial sums $S_{n}$ also form a Cauchy sequence, and therefore converge to a limit $S$ .

The final bound comes from applying the triangle inequality to show $∥ S_{n} ∥ = ‖ \sum_{j = 1}^{n} v_{j} ‖ \leq \sum_{j = 1}^{n} ∥ v_{j} ∥ \leq A_{n},$ and from there arguing that the limit $∥ S ∥$ is bounded by the limit $A$ .

Many other convergence tests similarly generalize to series in Banach spaces, often using the comparison test as a building block. An example is the ratio test:

Theorem 5.3 (Ratio test) Suppose ${v_{j}}_{j = 1}^{\infty}$ is a sequence in a Banach space $V$ , where $lim_{n \to \infty} ∥ v_{n + 1} ∥ / ∥ v_{n} ∥ = r < 1$ . Then $\sum_{j = 1}^{\infty} v_{j}$ converges absolutely.

Proof. Let $r < r^{'} < 1$ (e.g. choose $r^{'} = (r + 1) / 2$ ). Then by the hypothesis, there exists some $N$ such that for all $n \geq N$ , $∥ v_{n + 1} ∥ / ∥ v_{n} ∥ \leq r^{'}$ . Hence, for $n \geq N$ we have $∥ v_{n} ∥ \leq (r^{'})^{n - N} ∥ v_{N} ∥$ , and by the comparison test and convergence of the geometric series, the series $\sum_{j} v_{j}$ converges, and moreover we have the bound $\begin{aligned} ‖ \sum_{n = 1}^{\infty} v_{n} ‖ & \leq ‖ \sum_{n = 1}^{N - 1} v_{n} ‖ + \sum_{n = N}^{\infty} (r^{'})^{n - N} ∥ v_{N} ∥ \\ = ‖ \sum_{n = 1}^{N - 1} v_{n} ‖ + \frac{∥ v_{N} ∥}{1 - r^{'}} . \end{aligned}$

5.5.2 Function series

Frequently, we are interested in series where the terms are functions of some variable, e.g. $v_{j} : Ω \subset C \to V$ for some Banach space $V$ . Often the functional dependence of the terms on $v_{j}$ is fairly simple, e.g. $v_{j} (z) = {\bar{v}}_{j} z^{p_{j}}$ where the exponents $p_{j}$ may be just positive integers (a power series), positive and negative integers (a Laurent series), or fractions with a fixed denominator (a Puiseaux series). We also often see terms of the form $v_{n} (z) = {\bar{v}}_{n} \exp (i n z)$ (a Fourier series).

When the terms themselves belong to some Banach space, we can analyze the convergence or divergence of the series as a whole just as we would analyze the convergence or divergence of any other series over a Banach space. For example, if $Ω$ is compact, then the set of continuous functions $C (Ω, V)$ (which are automatically uniformly continuous) is a Banach space. Convergence in this Banach space corresponds to uniform convergence of the functions, and uniform convergence of uniformly continuous functions gives a uniformly continuous limit, as noted before.

However, more generally sometimes the terms may not belong to a Banach space; or if they do belong to a Banach space, they might diverge. In this case, we might care about the convergence or divergence of the series $\sum_{j = 1}^{\infty} v_{j} (z)$ for specific values of $z$ ; and, if we have pointwise convergence over some subset $Ω^{'} \subset Ω$ , we care what properties the limiting function $S : Ω^{'} \to V$ might inherit from the terms. This issue can be subtle: for example, a Fourier series can easily converge at almost all points in some interval, but converge to a discontinuous function despite the fact that all the terms are infinitely differentiable.

Function series can also be extremely useful even where they diverge, both for formal manipulation and for actual calculations where the approximation properties of finite sums are more important than the limiting behavior of the series.

5.5.3 Formal series

A formal power series is a series of the form $S (z) = \sum_{j = 0}^{\infty} a_{j} z^{j}$ where the coefficients $a_{j}$ belong to some appropriate vector space $V$ . In some circumstances, it also makes sense to include negative powers, in which case we sometimes refer to this as a formal Laurent series. Formal power series appear in a variety of applications. We will mostly see them in signal processing, where such a formal power series is also called the $z$ -transform of the sequence of coefficients $a_{j}$ . In that setting, is it particularly useful not only to add and scale formal power series, but also to multiply them together using the rule $(\sum_{j = 0}^{\infty} a_{j} z^{j}) (\sum_{j = 0}^{\infty} b_{j} z^{j}) = \sum_{k = 0}^{\infty} (\sum_{i + j = k} a_{i} b_{j}) z^{k};$ that is, the coefficients in the product are the convolution of the coefficient sequences of the multiplicand series. Hence, the power series can be seen as a convenient way of dealing with the book-keeping of tracking sequences and their convolutions, independent of whether things converge. Formal power series (along with other formal function series) also play a key role in combinatorics and statistics, often under the name of “generating functions” (see Wilf (2006)).

If $V$ is a Banach space and $\log ∥ a_{n} ∥ = O (n)$ , the ordinary formal power series associated with the coefficient sequence $a_{j}$ will converge for values of $z \in C$ close enough to zero. Under such assumptions (which often hold), we can treat the formal power series as defining a function, and can bring the machinery of analytic function theory to bear. Nonetheless, it is helpful to distinguish the formal power series — essentially a trick for indexing elements of a series — from the associated function to which it converges. Only the former is needed for tracking convolutions and the like.

5.5.4 Asymptotic series

An asymptotic series for a function $f$ is a formal function series with terms $v_{n} (z)$ such that $f (z) - \sum_{n = 0}^{N - 1} v_{n} (z) = o (∥ ϕ_{N - 1} (z) ∥)$ for some limiting behavior in $z$ (usually $z \to 0$ or $z \to \infty$ ). Asymptotic series often do not converge, but are nonetheless useful for approximation.

A standard example of an asymptotic series is Stirling’s approximation for the gamma function (generalized factorial function): $Γ (z + 1) \sim \sqrt{2 π z} {(\frac{z}{e})}^{z} (1 + \frac{1}{12} z^{- 1} + \frac{1}{288} z^{- 2} - \frac{139}{51840} z^{- 3} - O (z^{- 4})) .$ Here the symbol $\sim$ is used to denote that the series is asymptotic to the function, but does not necessarily converge to the function in the limit of an infinite number of terms. Another standard example is the asymptotic expansion for the tail of the standard normal cdf: $1 - Φ (z) = ϕ (z) z^{- 1} (1 - z^{- 2} + 3 z^{- 4} + O (z^{- 6})) .$ Both these methods are derived via Laplace’s method, which is also very useful for approximating integrals arising in Bayesian inference. These types of asymptotic approximations are typically complementary to numerical methods, providing very accurate estimate precisely in the regions where more conventional numerical approaches have the most difficulty.

5.5.5 Analytic functions

While formal series and asymptotic approximation are nice, even nicer things can happen for power series that converge. When a series $f (z) = \sum_{j = 0}^{\infty} c_{j} (z - z_{0})^{j}$ converges in a neighborhood of $z_{0}$ (in $R$ or $C$ ), we say the function $f$ is analytic at $z_{0}$ . Such a function is automatically infinitely differentiable in a neighborhood of $z_{0}$ . However, there are (useful) examples of non-analytic functions on $R$ that are infinitely differentiable, e.g. $g (x) = {\begin{cases} \exp (- 1 / x), & x \geq 0 \\ 0 & otherwise \end{cases}$

Complex analyticity around a point $z_{0} \in C$ (existence of a convergent Taylor series) and the property of being holomorphic around a point (having complex derivatives that satisfy the Cauchy-Riemann equations) are equivalent to each other. Functions that are real analytic in the neighborhood of a point in $R$ are also complex analytic in a neighborhood. However, a function that is real analytic over all of $R$ may not extend to a function that is complex analytic over all of $C$ .

The property of analyticity at a point $z_{0}$ is closed under addition, scaling, multiplication of functions, integration, differentiation, and composition. It is also closed under division, assuming that the denominator function is nonzero at $z_{0}$ .

5.5.6 Operator functions

5.5.7 Neumann series

5.6 Integration

5.6.1 Measure and integration

5.6.2 Standard inequalities

5.6.3 Change of variables

5.7 Contour integrals

5.7.1 Poles and residues

5.7.2 Resolvent calculus

5.8 Function spaces

“I turn away with fright and horror from this lamentable evil of functions which do not have derivatives.” – Hermite (according to Kline 1990, 973)↩︎
“But this long run is a misleading guide to current affairs. In the long run we are all dead. Economists set themselves too easy, too useless a task if in tempestuous seasons they can only tell us that when the storm is long past the ocean will be flat again.” — John Maynard Keynes, (from p.80, Keynes 1929)↩︎
A topological space is a set $X$ and a collection of sets (called the open sets) that include the empty set and all of $X$ and are closed under unions and finite intersections.↩︎
In general topological spaces, compactness and sequential compactness are not the same thing. Fortunately, it’s all the same in metric space (assuming the axiom of choice).↩︎