6  Optimization theory

6.1 Optimality conditions

We will frequently phrase model fitting problems in terms of optimization, which means we need to remember how optimality conditions work. This includes not only remembering first-order and second-order optimality conditions for derivatives, but also knowing a little about optimality conditions for non-differentiable functions. The key concept behind optimality conditions is to locally approximate a function in terms of a local polynomial approximation, where it usually suffices to go out to first or second order.

6.1.1 Minima and maxima

Let \(f : \Omega \subset \mathcal{V}\rightarrow \mathbb{R}\) be a continuous function. The set \(\Omega\) is the feasible set. We say the problem of minimizing or maximizing \(f\) is an unconstrained optimization problem when \(\Omega = \mathcal{V}\) (all points in the space are feasible). Otherwise, it is a constrained optimization problem.

We say \(x \in \Omega\) is a local minimizer if any \(y \in \Omega\) close enough to \(x\) satisfies \(f(y) \geq f(x)\) (more precisely: there is some \(\epsilon > 0\) so that for all \(y\) within \(\epsilon\) of \(x\), \(f(y) \geq f(x)\)). The value \(f(x)\) is the local minimum value. For a strong local minimizer, the inequality is strict – any \(y \neq x\) in a neighborhood of \(x\) satisfies \(f(y) > f(x)\). We say \(x\) is a global minimizer if \(f(x) \leq f(y)\) for all \(y \in \Omega\), and a strong global minimizer if \(f(x) < f(y)\) for all \(y \in \Omega\) other than \(x\).

We can define a local maximizer or a strong local maximizer analogously, but we usually focus on the case of minimization rather than maximization (and in any event, maximizing \(f\) is the same as minimizing \(-f\)).

When the feasible set \(\Omega\) is compact (for finite-dimensional vector spaces, this means closed and bounded), we are guaranteed that there is a global minimizer in \(\Omega\). In other cases, \(f\) may not be bounded below, or it may be bounded below but with no point where the greatest lower bound (the infimum) is achieved.

6.1.2 Derivative and gradient

If \(f : \mathcal{V}\rightarrow \mathbb{R}\) is differentiable, a stationary point is a point \(x\) such that \(f'(x) = 0\). At a non-stationary point, there is guaranteed to be some direction \(u\) such that \(f'(x) u > 0\) (and \(f'(x) (-u) < 0\)), so that \(f\) can neither attain an (unconstrained) minimum nor maximum at that point. Stationary points can be minima, maxima, or saddles; we usually classify them as such by the second derivative test.

In an inner product space, the gradient \(\nabla f(x)\) (the Riesz map of \(f'(x)\)) gives us the “direction of steepest ascent,” and the negative gradient gives us the direction of steepest descent. It is important to realize that this direction depends on the inner product used! Moreover, the concept of steepest ascent or descent generalizes to other normed linear spaces where we do not assume an inner product: all we need is the notion of the change in the function value relative to the distance from a starting point. For example, if \(\mathcal{V}\) is \(\mathbb{R}^n\) with the \(\ell^\infty\) norm, then the direction of steepest descent (or a direction, if \(f'(x)\) has some zero components) is given by a vector of \(s\) where \(s_j = \operatorname{sign}(f'(x)_j)\); here \(s\) is the vector of unit \(l^\infty\) norm such that \(f'(x) s\) is maximal.

6.1.3 Second derivative test

When \(f: \mathcal{V}\rightarrow \mathbb{R}\) is twice differentiable, we can characterize the function near a stationary point \(x\) by the second derivative: \[ f(x+u) = f(x) \frac{1}{2} f''(x) (u \otimes u) + o(\|u\|^2). \] The point \(x\) is a local minimizer if the quadratic form given by \(f''(x)\) is positive definite and a local maximizer if \(f''(x)\) is negative definite. If the Hessian is semidefinite, we need to look at higher derivatives to classify \(x\). If \(f''(x)\) is strongly indefinite (i.e. the inertia has nonzero positive and negative components), then \(x\) is a saddle point.

For computation, of course, we usually express \(f''(x)\) with respect to a basis. In this case, we describe the second derivative test in terms of the inertia of the Hessian matrix \(H_f(x)\).

6.1.4 Constraints and cones

We have described the first and second derivative tests in the context of unconstrained optimization, but the general approach is equally sensible for constrained optimization problems. We generally define the feasible domain \(\Omega\) in terms of a set of constraints, equations and inequalities involving functions \(c_i : \mathcal{V}\rightarrow \mathbb{R}\) that (for our purposes) are at least continuously differentiable: \[\begin{aligned} c_i(x) &= 0, & i &\in \mathcal{E}, \\ c_i(x) & \leq 0, & i &\in \mathcal{I}. \end{aligned}\] We say the \(i\)th inequality constraint is active at \(x\) if \(c_i(x) = 0\). We write the active set at \(x\) as \[ \mathcal{A}(x) = \{ i \in \mathcal{I} : c_i(x) = 0 \}. \] We do not worry about trying to classify the equality constraints, though in a sense they are always active.

6.1.4.1 Tangent cone

The first derivative test tells us that if \(f\) is locally minimized at \(x\), then a local linear model should not predict a nearby point with smaller function values. That is, there should not be \(f'(x) u < 0\) and a (differentiable) feasible path \(\gamma : [0,\tau) \rightarrow \mathcal{V}\) with \(\gamma(0) = x\) and \(\gamma'(0) = u\), or we will have \[ f(\gamma(\epsilon)) = f(x) + \epsilon f'(x) u + o(\epsilon) < f(x). \] for all sufficiently small \(\epsilon > 0\). The set of directions \(u\) that are tangent to some feasible path at \(x\) is known as the tangent cone at \(x\). A function \(f\) satisfies a (first order) geometric optimality condition at \(x\) if \[ \forall u \in T(x), f'(x) u \geq 0. \] where \(T(x)\) denotes the tangent cone. Equivalently, \[ -f'(x) \in T(x)^o, \] where \(T(x)^o\) is the polar cone of \(T(x)\), i.e. \[ T(x)^o = \{ w^* \in \mathcal{V}^* : \forall u \in T(x), w^* u \leq 0 \}. \] A local minimizer must satisfy this condition, though not all points that satisfy this condition need be local minimizers.

6.1.4.2 Linearized cone

The trouble with the geometric optimality condition is that we somehow need to characterize the tangent cone at each feasible point. Because we are characterizing the feasible set in terms of differentiable constraint functions, it is tempting to try to characterize feasible directions via the derivatives of the constraints by looking at the linearized cone \[\begin{aligned} L(x) = \left\{ u \in \mathcal{V}: \right. \quad & \left( \forall i \in \mathcal{E}, c_i'(x) u = 0 \right) \wedge \\ & \left. \left( \forall i \in \mathcal{A}(x), c_i'(x) u \leq 0 \right) \right\}. \end{aligned}\] It is true that \(T(x) \subseteq L(x)\), and the polars satisfy \(L(x)^o \subseteq T(x)^o\). Unfortunately, the linearized cone can be strictly bigger than the tangent cone. This is true even in 1D. For example, consider the domain \(x \geq 0\) written as \(c(x) = -x^3 \leq 0\). The tangent cone at \(x = 0\) is \(T(0) = \{ u \in \mathbb{R}: u \geq 0 \}\), but the linearized cone is \(L(0) = \mathbb{R}\). Hence, if we seek to minimize \(f(x) = x\) subject to \(-x^3 \leq 0\), the point \(x = 0\) satisfies the geometric optimality condition, but the condition is not satisfied if we replace the tangent cone with the linearized cone.

6.1.4.3 Constraint qualification

A constraint qualification condition is a hypothesis on the constraints that guarantees that \(L^o(x) = T^o(x)\), so that we can use the linearized cone in lieu of the tangent cone in the geometric optimality condition. Two common examples are the linearly independent constraint qualification (LICQ), which holds at \(x\) if \[ \{ c_i'(x) : i \in \mathcal{A}(x) \cup \mathcal{E} \} \mbox{ is linearly independent.} \] and the Mangasarian-Fromovitz constraint qualification (MFCQ), which holds at \(x\) if \[\begin{aligned} & \{ c_i'(x) : i \in \mathcal{E} \} \mbox{ is linearly independent and} \\ & \exists u \in \mathcal{V} : c_i'(x) u < 0 \mbox{ for } i \in \mathcal{A}(x) \mbox{ and } c_i'(x) u = 0 \mbox{ for } i \in \mathcal{E}. \end{aligned}\] Note that our example of the constraint \(-x^3 \leq 0\) satisfies neither LICQ nor MFCQ at \(0\).

6.1.5 Multipliers and KKT

We have already seen one theorem of alternatives in our discussion of linear algebra — the Fredholm alternative theorem, which deals with solvability of linear equations. There are many other theorems of alternatives for dealing with inequalities, associated with mathematicians like Motzkin, Kuhn, and Farkas. One such theorem is Farkas’ lemma: if \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\), then exactly one of the two statements is true:

  • There exists \(x \in \mathbb{R}^n\) such that \(Ax = b\) and \(x \geq 0\)
  • There exists \(y \in \mathbb{R}^m\) such that \(y^* A \geq 0\) and \(y^* b < 0\)

Here the inequalities are taken elementwise.

Using Farkas’s lemma, one can rewrite the polar of the linearized cone as \[ L(x)^o = \left\{ w^* \in \mathcal{V}^* : w^* = \sum_{i \in \mathcal{E}} \lambda_i c_i'(x) + \sum_{i \in \mathcal{A}(x)} \mu_i c_i'(x), \mu_i \geq 0 \right\}. \] It is usually convenient to define \(\mu_i = 0\) for inequality constraints that are inactive; in this case, we can rewrite \[\begin{aligned} L(x)^o = \{ w^* \in \mathcal{V}^* : \quad & w^* = \lambda^* c_{\mathcal{E}}'(x) + \mu^* c_{\mathcal{I}}'(x), \\ & \mu^* c_{\mathcal{I}}(x) = 0, \mu^* \geq 0 \}, \end{aligned}\] where \(c_{\mathcal{E}}(x)\) and \(c_{\mathcal{I}}(x)\) are (column) vectors of equality and inequality constraint functions and \(\lambda^*\) and \(\mu^*\) are concrete row vectors. The statement that \(\mu^* c_{\mathcal{I}}(x) = 0\), typically called a complementary slackness condition is equivalent to saying that inactive inequality constraints (where \(c_i(x) < 0\)) must be associated with zero multipliers.

With this machinery in place, we can now rewrite the condition \(-f'(x) \in L(x)^o\) in the form that we usually use. We define the Lagrangian function \[ \mathcal{L}(x, \mu^*, \lambda^*) = f(x) + \lambda^* c_{\mathcal{E}}(x) + \mu^* c_{\mathcal{I}}(x) \] The variables \(\lambda^*\) and \(\mu^*\) that multiply the constraints are known as Lagrange multipliers. The Karush-Kuhn-Tucker (KKT) conditions at \(x_*\), equivalent to \(-f'(x_*) \in L(x^*)^o\) are \[\begin{aligned} D_1 \mathcal{L}(x_*, \lambda, \mu) &= 0 & \mbox{constrained stationarity}\\ c_{\mathcal{E}}(x_*) &= 0, & \mbox{primal feasibility (equality)}\\ c_{\mathcal{I}}(x_*) & \leq 0, & \mbox{primal feasibility (inequality)}\\ \mu & \geq 0, & \mbox{dual feasibility}\\ \mu^* c_{\mathcal{I}}(x_*) &= 0, & \mbox{complementary slackness.} \end{aligned}\] When a constraint qualification condition holds, the KKT conditions are necessary first-order conditions for optimality at \(x_*\).

6.1.6 Multipliers and adjoints

Now suppose that we wanted to find a minimum (or maximum or other stationary point) of \(y\) subject to the equations we saw in Section 5.4.7 \[\begin{aligned} u-f(x) &= 0 \\ v-g(x,u) &= 0 \\ y-h(x,u,v) &= 0 \end{aligned}\] The Lagrangian for this system is \[ \mathcal{L} = y - \bar{u}^* (u-f(x)) - \bar{v}^* (v-g(x,u)) - \bar{y}^* (y-h(x,u,v)) \] where \(x, u, v, y\) are the primal variables and \(\bar{u}^*, \bar{v}^*, \bar{y}^*\) are multipliers1. Stationarity gives \[ \begin{bmatrix} \bar{u}^* & \bar{v}^* & \bar{y}^* \end{bmatrix} \begin{bmatrix} -D_1 f & 1 & 0 & 0 \\ -D_1 g & -D_2 g & 1 & 0 \\ -D_1 h & -D_2 h & -D_3 h & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 1 \end{bmatrix}. \] We can write this equivalently as \(\bar{x}^* = 0\) where \[ \begin{bmatrix} \bar{x}^* & \bar{u}^* & \bar{v}^* & \bar{y}^* \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ -D_1 f & 1 & 0 & 0 \\ -D_1 g & -D_2 g & 1 & 0 \\ -D_1 h & -D_2 h & -D_3 h & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 1 \end{bmatrix}. \] This is precisely the set of equations that we saw in Section 5.4.7, except that where we said “dual variables” in Section 5.4.7, here we are saying (negative) “Lagrange multipliers.”

6.1.7 Mechanical analogies

6.1.8 Constrained second derivatives

The second-order geometric optimality condition for a twice-differentiable \(f\) at \(x\) is a straightforward extension of the first-order geometric optimality condition. If \(x\) satisfies the first-order condition \[ \forall u \in T(x), f'(x) u \geq 0, \] and we satisfy that \[ \forall 0 \neq u \in T(x) \mbox{ s.t. } f'(x) u = 0, f''(x) (u \otimes u) > 0, \] then \(x\) is a constrained local minimum. If there are nonzero directions \(u \in T(x)\) where \(f'(x) u = 0\) and \(f'(x) (u \otimes u) < 0\), then \(x\) is not a constrained local minimum. Otherwise, we cannot determine minimality without looking at higher derivatives.

As with the geometric first-order condition, the geometric second-order condition is complicated by the need to get a handle on the tangent cone. Assuming that we satisfy the linearly independent constraint qualification condition at \(x\), we have a more straightforward version of the second derivative test. Let \[ \mathcal{J} = \mathcal{E} \cup \{ i \in \mathcal{I} : \mu_i > 0 \}. \] Then a sufficient condition for \(x\) to be a strong constrained local minimizer is if \[ \forall \mbox{ nonzero } u \in \mathcal{N}(c'_{\mathcal{J}}(x)), \quad f''(x) (u \otimes u) > 0. \] That is, the Hessian should be positive definite in all directions that are not already “uphill” at first order. Such a condition is known as conditional positive definiteness, since \(f''(x)\) is positive definite conditioned on some constraints on the directions considered. If there are such directions for which \(f''(x) (u \otimes u) < 0\), then we do not have a local minimizer; otherwise, we may need to consider higher derivatives to make a diagnosis.

6.2 Vector optimization

6.3 Convexity

A set \(\Omega \subset \mathcal{V}\) is convex if every line segment between points in \(\Omega\) also lies in \(\Omega\), i.e. \[ \forall x, y \in \Omega, \forall s \in [0,1], (1-s)x + s y \in \Omega. \] A function \(f : \Omega \subset \mathcal{V}\rightarrow \mathbb{R}\) is convex if the graph of \(f\) on a line segment in \(\Omega\) always lies below the secant connecting the endpoint values: \[ \forall x, y \in \Omega, \forall s \in [0,1], f((1-s) x + sy) \leq (1-s) f(x) + s f(y). \] We say \(f\) is strictly convex if the inequality is strict on the interior of the segment connecting \(x\) and \(y\): \[ \forall x \neq y \in \Omega, \forall s \in (0,1), f((1-s) x + sy) < (1-s) f(x) + s f(y). \] If \(f\) is twice differentiable, convexity corresponds to positive semi-definiteness of the Hessian, and strict convexity corresponds to positive definiteness of the Hessian. However, functions can be convex even if they are not twice differentiable.

A convex function on a closed, bounded set in finite dimensions is guaranteed to take on its minimum value \(f_{\min}\) at some point \(x \in \Omega\). There are no local minimizers, only global minimizers. The set of global minimizers \(\{x \in \Omega : f(x) = f_{\min}\}\) is itself a convex set. If \(f\) is strictly convex, then the global minimizer is unique.

6.3.1 Subderivatives

A subderivative of a convex \(f\) at \(x \in \Omega\) is the set of functionals \(w^* \in \mathcal{V}\) such that \(f(x+u) \geq f(x) + w^* u\) for all \(x + u \in \Omega\); that is, the subderivative corresponds to the set of “supporting hyperplanes” that agree with \(f(x)\) and lie below \(f\) elsewhere on \(\Omega\). When \(\mathcal{V}\) is an inner product space, we say \(w \in \mathcal{V}\) is in the subgradient at \(x \in \Omega\) if \(f(x+u) \geq f(x) + \langle u, w \rangle\) whenever \(x + u \in \Omega\).

When \(f\) is differentiable at \(x\), the only element of the subderivative is the derivative (and the only element of the subgradient is the gradient). However, the concept of a subderivative continues to make sense even for nonsmooth functions. For example, the absolute value function \(x \mapsto |x|\) on \(\mathbb{R}\) is not differentiable at 0, but has a subgradient \([-1,1]\).

The notion of a subderivative allows us to generalize the usual notion of stationary points for differentiable functions: a point \(x \in \Omega\) is a stationary point for a convex \(f\) if the subderivative of \(f\) at \(x\) contains the \(0 \in \mathcal{V}^*\), and a minimizer must be a stationary point.


  1. For equality constraints, the signs of the Lagrange multipliers are unimportant.↩︎