Project 3: Adaptive Splines

Due: 2022-04-22

A polyharmonic spline is a type of function approximator for functions from $R^{d} \to R$ of the form

$s (x) = \sum_{j = 1}^{m} ϕ (∥ x - u_{j} ∥) a_{j} + γ_{1 : d}^{T} x + γ_{d + 1}$

where

$ϕ (ρ) = {\begin{cases} ρ^{k} \log (ρ), & k even, \\ ρ^{k}, & k odd . \end{cases}$

Common examples include cubic splines $(k = 3)$ and thin plate splines ( $k = 2)$ . We will consider cubic splines here.

We will touch on two optimization questions in this project. First, how do we use splines to help optimization? Second, how do we best choose the coefficients $a$ and $γ$ and the centers ${u_{j}}_{j = 1}^{m}$ ?

md"""

# Project 3: Adaptive Splines

#### Due: 2022-04-22

A [*polyharmonic spline*](https://en.wikipedia.org/wiki/Thin_plate_spline) is a type of function approximator for functions from $\mathbb{R}^d \rightarrow \mathbb{R}$ of the form

$$s(x) = \sum_{j=1}^m \phi(\|x-u_j\|) a_j + \gamma_{1:d}^T x + \gamma_{d+1}$$

where

$$\phi(\rho) = \begin{cases}

\rho^k \log (\rho), & k \mbox{ even}, \\

\rho^k, & k \mbox{ odd}.

\end{cases}$$

Common examples include cubic splines $(k = 3)$ and thin plate splines ($k = 2)$. We will consider cubic splines here.

We will touch on two optimization questions in this project. First, how do we use splines to help optimization? Second, how do we best choose the coefficients $a$ and $\gamma$ and the centers $\{u_j\}_{j=1}^m$?

"""

1.1 ms

Code setup

In the interest of making all our lives easier, we have a fair amount of starter code for this project. You are not responsible for all of it in detail, but I certainly recommend at least becoming familiar with the interfaces!

md"""

## Code setup

"""

192 μs

using LinearAlgebra

269 μs

using Plots

2.9 s

Basis functions and evaluation

As a point of comparison, we will want to look at the ordinary spline procedure (where there are as many centers as there are data points, and data is given at the centers). To start with, we need $ϕ$ and its derivatives.

md"""

### Basis functions and evaluation

"""

181 μs

Hϕ (generic function with 1 method)

begin

ϕ(ρ :: Float64) = ρ^3

dϕ(ρ :: Float64) = 3*ρ^2

ϕ(r) = ϕ(norm(r))

∇ϕ(r) = 3*norm(r)*r

function Hϕ(r)

ρ = norm(r)

if ρ == 0.0

zeros(length(r), length(r))

else

u = r/ρ

3*ρ*(I + u*u')

end

1.0 ms

When evaluating the spline will be convenient to refer to $a$ and $γ$ together in one vector, which we will call $c$ .

md"""

When evaluating the spline will be convenient to refer to $a$ and $\gamma$ together in one vector, which we will call $c$.

"""

145 μs

spline_eval (generic function with 1 method)

function spline_eval(x :: Vector, u, c)

d, m = size(u)

sum(c[j]*ϕ(x-u[:,j]) for j=1:m) + c[m+1:m+d]'*x + c[end]

end

1.3 ms

spline_eval (generic function with 2 methods)

function spline_eval(x :: Matrix, u, c)

d, n = size(x)

d, m = size(u)

[sum(c[j]*ϕ(xi-u[:,j]) for j=1:m) + c[m+1:m+d]'*xi + c[end]

for xi in eachcol(x)]

end

1.5 ms

We are frequently going to care about the residual on our test function (over many evaluation points), so we also provide small helpers for that computation and for plotting the spline and the residual.

md"""

"""

120 μs

plot_spline (generic function with 1 method)

function plot_spline(u, c)

plot(xxh, xxh, (x,y) -> spline_eval([x;y], u, c), st=:contourf)

plot!(u[1,:], u[2,:], st=:scatter, legend=false)

end

793 μs

Fitting the spline

We consider two methods for fitting the spline. For both, we need to compute two matrices:

The kernel matrix $K_{X U}$ where $(K_{X U})_{i j} = ϕ (∥ x_{i} - u_{j} ∥)$
The polynomial part $Π_{X}$ consisting of rows of the form $[\begin{matrix} x_{i}^{T} & 1 \end{matrix}]$

md"""

### Fitting the spline

We consider two methods for fitting the spline. For both, we need to compute two matrices:

- The kernel matrix $K_{XU}$ where $(K_{XU})_{ij} = \phi(\|x_i-u_j\|)$

- The polynomial part $\Pi_{X}$ consisting of rows of the form $\begin{bmatrix} x_i^T & 1 \end{bmatrix}$

"""

1.5 ms

spline_Π (generic function with 1 method)

begin

spline_K(x, u) = [ϕ(xi-uj) for xi in eachcol(x), uj in eachcol(u)]

spline_Π(x) = [x' ones(size(x)[2])]

end

878 μs

Later, we will also have need for the Jacobian of $K_{X U}$ with respect to the components of $u$ .

md"""

Later, we will also have need for the Jacobian of $K_{XU}$ with respect to the components of $u$.

"""

148 μs

Dspline_K (generic function with 1 method)

function Dspline_K(x, u)

d, n = size(x)

d, m = size(u)

Jac = zeros(n, d*m)

for i = 1:n

for j = 1:m

J = (j-1)*d+1:j*d

Jac[i,J] = ∇ϕ(u[:,j]-x[:,i])

end

Jac

end

1.2 ms

Standard spline fitting

In the standard scheme, we have one center at each data point. Fitting the spline involves solving the system

$[\begin{matrix} K + σ^{2} I & Π \\ Π^{T} & 0 \end{matrix}] [\begin{matrix} a \\ γ \end{matrix}] = [\begin{matrix} y \\ 0 \end{matrix}] .$

The first $n$ equations are interpolation conditions. The last $d + 1$ equations are sometimes called the "discrete orthogonality" conditions. When the regularization parameter $σ^{2}$ is nonzero, we have a smoothing spline.

md"""

#### Standard spline fitting

In the standard scheme, we have one center at each data point. Fitting the spline involves solving the system

$$\begin{bmatrix} K+\sigma^2 I & \Pi \\ \Pi^T & 0 \end{bmatrix}

\begin{bmatrix} a \\ \gamma \end{bmatrix} =

\begin{bmatrix} y \\ 0 \end{bmatrix}.$$

The first $n$ equations are interpolation conditions. The last $d+1$ equations are sometimes called the "discrete orthogonality" conditions. When the regularization parameter $\sigma^2$ is nonzero, we have a *smoothing spline*.

"""

227 μs

spline_fit (generic function with 1 method)

function spline_fit(x, y :: Vector; σ=0.0)

d, n = size(x)

K = spline_K(x, x)

Π = spline_Π(x)

[K+σ^2*I Π; Π' zeros(d+1,d+1)]\[y; zeros(d+1)]

end

1.2 ms

spline_fit (generic function with 2 methods)

spline_fit(x, f; σ=0.0) = spline_fit(x, [f(xi) for xi in eachcol(x)], σ=σ)

1.2 ms

Least squares fitting

When the number of data points is large, we may want to use a smaller number of centers as the basis for our approximation. That is, we consider an approximator of the form

$s (x) = \sum_{j = 1}^{n} ϕ (∥ x - u_{j} ∥) a_{j} + γ_{1 : d}^{T} x + γ_{d + 1}$

where the locations $u_{j}$ are decoupled from the $x_{i}$ . We determine the coefficients $a$ and $γ$ by a least squares problem

$minimize {∥ [\begin{matrix} K_{X U} & Π \end{matrix}] [\begin{matrix} a \\ γ \end{matrix}] - y ∥}^{2} + σ^{2} ∥ a ∥^{2}$

md"""

#### Least squares fitting

When the number of data points is large, we may want to use a smaller number of centers as the basis for our approximation. That is, we consider an approximator of the form

$$s(x) = \sum_{j=1}^n \phi(\|x-u_j\|) a_j + \gamma_{1:d}^T x + \gamma_{d+1}$$

where the locations $u_j$ are decoupled from the $x_i$. We determine the coefficients $a$ and $\gamma$ by a least squares problem

$$\mbox{minimize } \left\|\begin{bmatrix} K_{XU} &\Pi \end{bmatrix} \begin{bmatrix} a \\ \gamma \end{bmatrix} - y\right\|^2 + \sigma^2 \|a\|^2$$

"""

216 μs

spline_fit (generic function with 3 methods)

function spline_fit(u, x, y :: Vector; σ=0.0)

d, m = size(u)

if σ == 0.0

[spline_K(x, u) spline_Π(x)]\y

else

[spline_K(x, u) spline_Π(x); σ*I zeros(m, d+1)]\[y; zeros(m)]

end

1.2 ms

spline_fit (generic function with 4 methods)

spline_fit(u, x, f; σ=0.0) = spline_fit(u, x, [f(xi) for xi in eachcol(x)], σ=σ)

1.1 ms

Sampling strategies

Splines are fit to sampled data, so we want a number of ways to draw samples. Our example is 2D, so we will stick to 2D samplers.

We start with samplers on a regular mesh of points $(x_{i}, y_{j})$ (sometimes called a tensor product mesh). These get big very fast in high-dimensional spaces, but are fine in 2D.

md"""

### Sampling strategies

Splines are fit to sampled data, so we want a number of ways to draw samples. Our example is 2D, so we will stick to 2D samplers.

We start with samplers on a regular mesh of points $(x_i, y_j)$ (sometimes called a tensor product mesh). These get big very fast in high-dimensional spaces, but are fine in 2D.

"""

187 μs

meshgrid (generic function with 1 method)

function meshgrid(xx, yy)

nx = length(xx)

ny = length(yy)

result = zeros(2, nx, ny)

result[1,:,:] = xx*ones(ny)'

result[2,:,:] = ones(nx)*yy'

reshape(result, 2, nx*ny)

end

519 μs

meshgrid_uniform (generic function with 1 method)

function meshgrid_uniform(xlo, xhi, ylo, yhi, nx, ny)

meshgrid(range(xlo, xhi, length=nx),

range(ylo, yhi, length=ny))

end

435 μs

meshgrid_cheb (generic function with 1 method)

function meshgrid_cheb(xlo, xhi, ylo, yhi, nx, ny)

chebnodes(l,h,n) = (h+l)/2 .+ (h-l)/2*cos.((2*(1:n).-1)/(2n)*π)

meshgrid(chebnodes(xlo, xhi, nx),

chebnodes(ylo, yhi, nx))

end

791 μs

In higher-dimensional spaces, it is tempting to choose random samples. But taking independent uniform draws is not an especially effective way of covering a space – random numbers tend to clump up. For this reason, low discrepancy sequences are often a better basis for sampling than (pseudo)random draws. There are many such generators; we use a relatively simple one based on an additive recurrence with a multiplier based on the "generalized golden ratio".

md"""

In higher-dimensional spaces, it is tempting to choose random samples. But taking independent uniform draws is not an especially effective way of covering a space -- random numbers tend to clump up. For this reason, [*low discrepancy sequences*](https://en.wikipedia.org/wiki/Low-discrepancy_sequence) are often a better basis for sampling than (pseudo)random draws. There are many such generators; we use a relatively simple one based on an additive recurrence with a multiplier based on the ["generalized golden ratio"](http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/).

"""

221 μs

kronecker_quasirand (generic function with 2 methods)

function kronecker_quasirand(d, N, start=0)

# Compute the recommended constants ("generalized golden ratio")

ϕ = 1.0+1.0/d

for k = 1:10

gϕ = ϕ^(d+1)-ϕ-1

dgϕ= (d+1)*ϕ^d-1

ϕ -= gϕ/dgϕ

end

αs = [mod(1.0/ϕ^j, 1.0) for j=1:d]

# Compute the quasi-random sequence

Z = zeros(d, N)

for j = 1:N

for i=1:d

Z[i,j] = mod(0.5 + (start+j)*αs[i], 1.0)

end

2.4 ms

We will provide both random and quasi-random samplers.

md"""

We will provide both random and quasi-random samplers.

"""

106 μs

qrand2d (generic function with 1 method)

begin

rand2d(xlo, xhi, ylo, yhi, n) = xlo .+ (xhi-xlo)*rand(2,n)

qrand2d(xlo, xhi, ylo, yhi, n) = xlo .+ (xhi-xlo)*kronecker_quasirand(2,n)

end

558 μs

Optimizers

Most of the time if you are using standard optimizers like Levenberg-Marquardt or Newton, you should use someone else's code and save your ingenuity for problem formulation and initial guesses. In this spirit, we provide two solvers that we will see in class in a couple weeks: a Levenberg-Marquardt implementation and a Newton implementation.

We will see the Levenberg-Marquardt algorithm for nonlinear least squares as part of our regular lecture material, but we also include it here. This version uses logic for dynamically scaling the damping parameter (courtesy Nocedal and Wright).

md"""

### Optimizers

"""

187 μs

levenberg_marquardt (generic function with 1 method)

function levenberg_marquardt(f, J, x; nsteps=100, rtol=1e-8, τ=1e-3,

monitor=(x, rnorm, μ)->nothing)

# Evaluate everything at the initial point

x = copy(x)

xnew = copy(x)

fx = f(x)

Jx = J(x)

Hx = Jx'*Jx

μ = τ * maximum(diag(Hx)) # Default damping parameter

ν = 2.0 # Step re-scaling parameter (default value)

for k = 1:nsteps

# Check for convergence

g = Jx'*fx

rnorm = norm(Jx'*fx)

monitor(x, rnorm, μ)

if rnorm < rtol

return

end

# Compute a proposed step and re-evaluate residual vector

p = (Hx + μ*I)\(-g)

xnew[:] = x[:] + p

fxnew = f(xnew)

# Compute the gain ratio

ρ = (norm(fx)^2 - norm(fxnew)^2) / (norm(fx)^2 - norm(fx+Jx*p)^2)

if ρ > 0 # Success!

# Accept new point

x[:] = xnew

fx = fxnew

3.0 ms

The trust-region Newton solver optimizes a quadratic model over some region in which it is deemed trustworthy. The model does not have to be convex!

md"""

The trust-region Newton solver optimizes a quadratic model over some region in which it is deemed trustworthy. The model does not have to be convex!

"""

113 μs

solve_tr (generic function with 1 method)

function solve_tr(g, H, Δ)

n = length(g)

# Check interior case

try

F = cholesky(H)

p = -(F\g)

if norm(p) <= Δ

return p, false

end

catch e

# Hit this case if Cholesky errors (not pos def)

end

# Compute the relevant eigensolve

w = g/Δ

M = [H -I ;

-w*w' H ]

λs, V = eigen(M)

# The right most eigenvalue (always sorted to the end in Julia) is real,

# and corresponds to the desired λ

λ = -real(λs[1])

v = real(V[:,1])

y2 = v[1:n]

y1 = v[n+1:end]

# Check if we are in the hard case (to some tolerance)

gap = real(λs[2])-real(λs[1])

if norm(y1) <= 1e-8/sqrt(gap)

# Hard case -- we punt a little and assume only one null vector

# Compute min-norm solution plus a multiple of the null vector.

v = y2/norm(y2)

q = -(H+norm(H)/n^2*v*v')\g

return q + v*sqrt(Δ^2-q'*q), true

else

2.2 ms

tr_newton (generic function with 1 method)

function tr_newton(x0, ϕ, ∇ϕ, Hϕ; nsteps=100, rtol=1e-6, Δmax=Inf, monitor=(x, rnorm, Δ)->nothing)

# Compute an intial step and try trusting it

x = copy(x0)

xnew = copy(x0)

ϕx = ϕ(x)

gx = ∇ϕ(x)

Hx = Hϕ(x)

p = -Hx\gx

Δ = 1.2 * norm(p)^2

hit_constraint = false

for k = 1:nsteps

# Compute gain ratio for new point and decide to accept or reject

xnew[:] = x[:] + p

ϕnew = ϕ(xnew)

μdiff = -( gx'*p + (p'*Hx*p)/2 )

ρ = (ϕx - ϕnew)/μdiff

# Adjust radius

if ρ < 0.25

Δ /= 4.0

elseif ρ > 0.75 && hit_constraint

Δ = min(2*Δ, Δmax)

end

# Accept if enough gain (and check convergence)

if ρ > 0.1

x[:] = xnew

ϕx = ϕnew

gx = ∇ϕ(x)

monitor(x, norm(gx), Δ)

if norm(gx) < rtol

return

end

2.9 ms

Incremental QR and forward selection

The forward selection algorithm is a factor selection process for least squares problems in which factors are added to a model in a greedy fashion in order to minimize the residual at each step.

md"""

### Incremental QR and forward selection

The [*forward selection* algorithm](https://en.wikipedia.org/wiki/Stepwise_regression) is a factor selection process for least squares problems in which factors are added to a model in a greedy fashion in order to minimize the residual at each step.

For fast implementation, it will be convenient for us to be able to extend a QR factorization by adding additional columns to an existing factorization. This is easy enough to do in practice, but requires a little poking around in the [guts of the Julia QR code](https://github.com/JuliaLang/julia/blob/bf534986350a991e4a1b29126de0342ffd76205e/stdlib/LinearAlgebra/src/qr.jl#L4-L36).

"""

266 μs

windowed_qr (generic function with 1 method)

# Start a QR factorization overwriting the first n columns of A

windowed_qr(A, n) = LinearAlgebra.qrfactUnblocked!(view(A,:,1:n))

307 μs

windowed_qr (generic function with 2 methods)

# Extend a "windowed" QR factorization up to the first n columns of A

function windowed_qr(F :: QR, A, n)

m, k = size(F)

lmul!(F.Q', view(A,:,k+1:n))

Fnew = LinearAlgebra.qrfactUnblocked!(view(A,k+1:m,k+1:n))

QR(view(A,:,1:n), [F.τ; Fnew.τ])

end

665 μs

We also will want to compute residuals for many of our algorithms.

md"""

We also will want to compute residuals for many of our algorithms.

"""

110 μs

resid_ls! (generic function with 1 method)

# Overwrite the input with a least squares residual

function resid_ls!(F, y)

m, n = size(F)

lmul!(F.Q', y)

y[1:n] .= 0.0

lmul!(F.Q, y)

end

513 μs

resid_ls! (generic function with 2 methods)

function resid_ls!(F, y, r)

r[:] = y

resid_ls!(F, r)

end

260 μs

solve_ls! (generic function with 1 method)

# Solve a least squares problem and overwrite the rhs with the residual

function solve_ls!(F, y)

m, n = size(F)

lmul!(F.Q', y)

c = F.R\y[1:n]

y[1:n] .= 0.0

lmul!(F.Q, y)

c, y

end

595 μs

forward_selection (generic function with 1 method)

# Starting with kstart factors, greedily choose another kend-kstart factors

# to minimize the residual A[:,I]*x-b. Returns

# - The selected index set I

# - A QR factorization of A[:,I]

# - The solution vector x

# - The residual vector r = b-A[:,I]*x

function forward_selection(A, b, kstart, kend)

m, n = size(A)

# Set up storage

AQR = zeros(m, kend)

τ = zeros(m)

z = zeros(kend)

idx = zeros(Int, kend)

# Compute column norms

Anorm = [norm(aj) for aj = eachcol(A)]

# Select first k columns (always included)

AQR[:,1:kstart] = A[:,1:kstart]

idx[1:kstart] = 1:kstart

# Start factorization and compute initial residual

F = windowed_qr(AQR, kstart)

r = resid_ls!(F, copy(b))

for k = kstart+1:kend

# Check vs all candidates, pick the one r projects on most

rproj = (A'*r) ./ Anorm

jmax = argmax(abs.(rproj))

# Fill in column k with selected candidate

AQR[:,k] = A[:,jmax]

idx[k] = jmax

2.1 ms

Some sanity checks on a random test case (keep first 5, choose 5 more):

1 2 3 4 5 11 12 20 17 6
selected
Does A[:,I] = QR? Relerr 3.733265243763775e-16
Does r = b-A[:,I]*x? Relerr 5.105187043779148e-16
Does A[:,I]'*r = 0? Relerr 1.2869532583151618e-17

let

A = rand(100,20)

b = rand(100)

I, F, x, r = forward_selection(A, b, 5, 10)

r0 = b-A[:,I]*x

g = A[:,I]'*r

md"""

Some sanity checks on a random test case (keep first 5, choose 5 more):

- $I selected

- Does A[:,I] = QR? Relerr $( norm(A[:,I]-F.Q*F.R)/norm(F.R) )

- Does r = b-A[:,I]*x? Relerr $( norm(r-r0)/norm(r) )

- Does A[:,I]'*r = 0? Relerr $( norm(g)/norm(A[:,I])/norm(b) )

"""

end

1.0 s

Test function

The Himmelblau function is a standard multi-modal test function used to explore the properties of optimization methods. We will take the log of the Himmelblau function (shifted up a little in order to avoid log-of-zero issues) as our running example.

md"""

## Test function

The [Himmelblau function](https://en.wikipedia.org/wiki/Himmelblau%27s_function) is a standard multi-modal test function used to explore the properties of optimization methods. We will take the log of the Himmelblau function (shifted up a little in order to avoid log-of-zero issues) as our running example.

"""

170 μs

begin

himmelblau(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2

himmelblau(xy) = himmelblau(xy[1], xy[2])

log_himmelblau(x,y) = log(10 + himmelblau(x,y))

log_himmelblau(xy) = log(10 + himmelblau(xy))

xxh = range(-6.0, 6.0, length=100)

p_logh = plot(xxh, xxh, log_himmelblau, st=:contourf)

end

1.8 s

The Himmelblau function has a local maximum at $(- 0.270845, - 0.923039)$ and four global minimizers:

$\begin{aligned} g (3.0, 2.0) & = 0.0, \\ g (- 2.805118, 3.131312) & = 0.0, \\ g (- 3.779310, - 3.283186) & = 0.0, \\ g (3.584428, - 1.848126) & = 0.0 . \end{aligned}$

md"""

The Himmelblau function has a local maximum at $(-0.270845, -0.923039)$ and four global minimizers:

$$\begin{align*}

g(3.0,2.0)&=0.0, \\

g(-2.805118,3.131312)&=0.0, \\

g(-3.779310,-3.283186)&=0.0, \\

g(3.584428,-1.848126)&=0.0.

\end{align*}$$

"""

121 μs

Tasks

Express the Himmelblau function as a nonlinear least squares function and use the provided Levenberg-Marquardt solver to find the minimum at $(3, 2)$ from a starting guess at $(2, 2)$ . Use the monitor function to produce a semilog convergence plot.
Run ten steps of an (unguarded) Newton solver for the Himmelblau function. It may save you some algebra to remember the relation between the Hessian of a nonlinear least squares function and the Gauss-Newton approximation. What does it converge to from the starting guess $(2, 2)$ ? Plot convergence (gradient norm vs iteration) again.
Run the same experiment with the Newton solver, but this time using the trust region version that is provided.

md"""

##### Tasks

1. Express the Himmelblau function as a nonlinear least squares function and use the provided Levenberg-Marquardt solver to find the minimum at $(3, 2)$ from a starting guess at $(2, 2)$. Use the monitor function to produce a semilog convergence plot.

2. Run ten steps of an (unguarded) Newton solver for the Himmelblau function. It may save you some algebra to remember the relation between the Hessian of a nonlinear least squares function and the Gauss-Newton approximation. What does it converge to from the starting guess $(2, 2)$? Plot convergence (gradient norm vs iteration) again.

3. Run the same experiment with the Newton solver, but this time using the trust region version that is provided.

"""

304 μs

Optimizing on a surrogate

Let's now fit a surrogate to the Himmelblau function sampled on 50 samples drawn from our quasi-random number generator.

md"""

## Optimizing on a surrogate

Let's now fit a surrogate to the Himmelblau function sampled on 50 samples drawn from our quasi-random number generator.

"""

155 μs

begin

uqrand = qrand2d(-6.0, 6.0, -6.0, 6.0, 50)

c_qr1 = spline_fit(uqrand, log_himmelblau)

plot_spline(uqrand, c_qr1)

end

652 ms

Does the spline have minima close to the minima of the true function? The answer depends how "nice" the minimizers are and how good the approximation error is. Concretely, suppose $| s (x) - g (x) | < δ$ over $x \in Ω$ and $x_{*}$ is a minimizer of $g$ . If $min_{∥ u ∥ = 1} g (x_{*} + ρ u) - g (x_{*}) > 2 δ$ and the ball of radius $ρ$ around $x_{*}$ lies entirely inside $Ω$ , then $min_{∥ u ∥ = 1} s (x_{*} + ρ u) - s (x_{*}) > 0$ . Assuming continuity, this is enough to guarantee that $s$ has a minimizer in the unit ball of radius $ρ$ around $x_{*}$ . If $g$ has a Lipschitz continuous Hessian with Lipschitz constant $M$ and $ρ < λ_{min} (H_{g} (x_{*})) / M$ , then a sufficient condition is that $\frac{1}{2} ρ^{2} (λ_{min} (H_{g} (x_{*})) - M ρ) > 2 δ$ (still assuming the ball of radius $ρ$ lies within the domain $Ω$ ).

(We can get better bounds if we can control the error in the derivative approximation, but we will leave this alone in the current assignment.)

md"""

Does the spline have minima close to the minima of the true function? The answer depends how "nice" the minimizers are and how good the approximation error is. Concretely, suppose $|s(x)-g(x)| < \delta$ over $x \in \Omega$ and $x_*$ is a minimizer of $g$. If $\min_{\|u\|=1} g(x_*+\rho u) - g(x_*) > 2\delta$ and the ball of radius $\rho$ around $x_*$ lies entirely inside $\Omega$, then $\min_{\|u\|=1} s(x_*+\rho u)-s(x_*) > 0$. Assuming continuity, this is enough to guarantee that $s$ has a minimizer in the unit ball of radius $\rho$ around $x_*$. If $g$ has a Lipschitz continuous Hessian with Lipschitz constant $M$ and $\rho < \lambda_{\min}(H_g(x_*))/M$, then a sufficient condition is that $\frac{1}{2} \rho^2 (\lambda_{\min}(H_g(x_*))-M\rho) > 2 \delta$ (still assuming the ball of radius $\rho$ lies within the domain $\Omega$).

(We can get better bounds if we can control the error in the derivative approximation, but we will leave this alone in the current assignment.)

"""

293 μs

Putting aside theoretical considerations for a moment, let's try to find the minimizer (or at least a local minimizer) of the spline approximation $s (x)$ . We can make a first pass at this by sampling on a mesh and taking the point where we see the smallest function value.

md"""

Putting aside theoretical considerations for a moment, let's try to find the minimizer (or at least a local minimizer) of the spline approximation $s(x)$. We can make a first pass at this by sampling on a mesh and taking the point where we see the smallest function value.

"""

128 μs

Best point found at 3.473684210526316, -0.9473684210526315

let

xy = meshgrid_uniform(-6.0, 6.0, -6.0, 6.0, 20, 20)

gxy = spline_eval(xy, uqrand, c_qr1)

xy_best = xy[:,gxy .<= minimum(gxy)]

plot_spline(uqrand, c_qr1)

p = scatter!([xy_best[1,:]], [xy_best[2,:]], marker=:star5, markersize=10)

md"""

Best point found at $(xy_best[1]), $(xy_best[2])

"""

end

670 ms

Tasks

Let $x_{*}$ be a strong local minimizer of $g$ , where $g$ is twice continuously differentiable and the Hessian $H_{g}$ has a Lipschitz constant $M$ in the operator 2-norm, and that $ρ < λ_{min} (H_{g} (x_{*})) / M$ . Using a Taylor expansion about $x_{*}$ , show that $\frac{1}{2} ρ^{2} (λ_{min} (H_{g} (x_{*})) - M ρ) > 2 δ$ is sufficient to guarantee that $min_{∥ u ∥ = 1} g (x_{*} + ρ u) - g (x_{*}) > 2 δ$ .
Compute the smallest singular value of the Hessian of the log-transformed Himmelblau function at $(3, 2)$ . Also compute the approximation error in the spline fit at $(3, 2)$ and use that as an approximate $δ$ and plug into the estimate

$\tilde{ρ} = 2 \sqrt{\frac{δ}{λ_{min} (H_{g} (x_{*}))}}$

Take ten steps of Newton to find minima the minimum of the spline close to the point found above. Plot the gradient norm on a semilog scale to verify quadratic convergence. What happens with a starting guess of $(3, 2)$ ?

For the first task, you may use without proof that $λ_{min} (A + E) \geq λ_{min} (A) - ∥ E ∥_{2}$ when $A$ and $E$ are symmetric. For the second question, remember that the log transform we used is $g \mapsto \log (g + 10)$ . You should also not expect to get a tiny value for $\tilde{ρ}$ !

md"""

##### Tasks

1. Let $x_*$ be a strong local minimizer of $g$, where $g$ is twice continuously differentiable and the Hessian $H_g$ has a Lipschitz constant $M$ in the operator 2-norm, and that $\rho < \lambda_{\min}(H_g(x_*))/M$. Using a Taylor expansion about $x_*$, show that $\frac{1}{2} \rho^2(\lambda_{\min}(H_g(x_*))-M\rho) > 2 \delta$ is sufficient to guarantee that $\min_{\|u\|=1} g(x_* + \rho u) - g(x_*) > 2\delta$.

2. Compute the smallest singular value of the Hessian of the log-transformed Himmelblau function at $(3,2)$. Also compute the approximation error in the spline fit at $(3,2)$ and use that as an approximate $\delta$ and plug into the estimate

$$\tilde{\rho} = 2\sqrt{\frac{\delta}{\lambda_{\min}(H_g(x_*))}}$$

3. Take ten steps of Newton to find minima the minimum of the spline close to the point found above. Plot the gradient norm on a semilog scale to verify quadratic convergence. What happens with a starting guess of $(3,2)$?

For the first task, you may use without proof that $\lambda_{\min}(A+E) \geq \lambda_{\min}(A)-\|E\|_2$ when $A$ and $E$ are symmetric. For the second question, remember that the log transform we used is $g \mapsto \log(g + 10)$. You should also not expect to get a tiny value for $\tilde{\rho}$!

"""

439 μs

Non-adaptive sampling strategies

If we want to use a spline as a surrogate for optimization, we need it to be sufficiently accurate, where what is "sufficient" is determined by the behavior of the Hessian near the desired minima. For the rest of this assignment, we focus on making more accurate splines.

We start by defining a highly refined "ground truth" mesh of $10^{4}$ points laid out in a mesh. Our figure of merit will be the root mean squared approximation error. We also define a coarser "data mesh" that we will use for least squares spline fits.

md"""

## Non-adaptive sampling strategies

We start by defining a highly refined "ground truth" mesh of $10^4$ points laid out in a mesh. Our figure of merit will be the root mean squared approximation error. We also define a coarser "data mesh" that we will use for least squares spline fits.

"""

222 μs

Float64

6.80239

6.67264

6.56744

6.4912

6.44594

6.43065

6.44158

6.47325

6.51967

6.57524

6.63529

6.69625

6.75551

6.81124

6.86228

6.90786

6.94756

6.9812

7.00873

7.03024

1591

6.98323

1592

7.02582

1593

7.0773

1594

7.13835

1595

7.20922

1596

7.28979

1597

7.37953

1598

7.47761

1599

7.58297

1600

7.69439

begin

xx_truth = meshgrid_uniform(-6.0, 6.0, -6.0, 6.0, 100, 100)

y_truth = [log_himmelblau(x) for x in eachcol(xx_truth)]

rms_vs_truth(u, c) = norm(y_truth - spline_eval(xx_truth, u, c)) / sqrt(length(y_truth))

xx_data = meshgrid_uniform(-6.0, 6.0, -6.0, 6.0, 40, 40)

y_data = [log_himmelblau(x) for x in eachcol(xx_data)]

end

42.7 ms

We would like to choose a sample set that is going to minimize the RMS error. We will talk about adapting the sample later; for now, let's consider different ways that we could lay out a set of about 50 sample points:

We could choose a 7-by-7 mesh (uniform or Chebyshev)
We could choose 50 points at random
We could choose 50 points from a low-discrepancy sequence (quasi-random)

Let's set up an experimental framework to compare the RMS error on the ground truth mesh for each of these approaches.

md"""

- We could choose a 7-by-7 mesh (uniform or Chebyshev)

- We could choose 50 points at random

- We could choose 50 points from a low-discrepancy sequence (quasi-random)

Let's set up an experimental framework to compare the RMS error on the ground truth mesh for each of these approaches.

"""

305 μs

test_sample (generic function with 1 method)

function test_sample(name, u)

c = spline_fit(u, log_himmelblau)

p = plot_spline(u, c)

r = rms_vs_truth(u, c)

md"""

###### $name (standard)

RMS: $r

"""

end

3.4 ms

test_sample (generic function with 2 methods)

function test_sample(name, u, xdata, ydata)

c = spline_fit(u, xdata, ydata)

p = plot_spline(u, c)

r = rms_vs_truth(u, c)

md"""

###### $name (least squares)

RMS: $r

"""

end

630 μs

Tasks

Run experiments with the test_sample functions for both the standard spline fitting and least squares fitting against the 4000-point data set. Comment on anything you observe from your experiments. Feel free to play with the parameters (e.g. number of points).

md"""

##### Tasks

Run experiments with the `test_sample` functions for both the standard spline fitting and least squares fitting against the 4000-point data set. Comment on anything you observe from your experiments. Feel free to play with the parameters (e.g. number of points).

"""

164 μs

Forward stepwise regression

Forward stepwise regression is a factor selection method in which we greedily choose the next factor from some set of candidates based on which candidate will most improve the solution. We have provided you with a reference implementation of the algorithm. Your task is to use this method to choose centers from a candidate set of points in order to improve your least squares fit.

md"""

## Forward stepwise regression

"""

184 μs

Tasks

Complete the spline_forward_regression function below, then run the code with 500 points chosen from our quasirandom sequence. Run the test harness above to see how much the RMS error measure improves.

md"""

##### Tasks

Complete the `spline_forward_regression` function below, then run the code with 500 points chosen from our quasirandom sequence. Run the test harness above to see how much the RMS error measure improves.

"""

150 μs

spline_forward_regression (generic function with 1 method)

function spline_forward_regression(ucand, xx, y, npoints)

end

239 μs

Levenberg-Marquardt refinement

In the lecture notes from 4/11, we discuss the idea of variable projection algorithms for least squares. We can phrase the problem of optimizing the centers in this way; that is, we let

$r = (I - A A^{†}) \bar{y}$

where

$A = [\begin{matrix} K_{X U} & Π_{X} \\ σ I & 0 \end{matrix}], \bar{y} = [\begin{matrix} y \\ 0 \end{matrix}]$

is treated as a function of the center locations. The regularization parameter $σ$ plays a critical role in this particular problem, so we are careful to keep it throughout!

To save us all some trouble, I will provide the code to compute $r$ and the Jacobian with respect to the center coordinates.

md"""

## Levenberg-Marquardt refinement

In the lecture notes from 4/11, we discuss the idea of variable projection algorithms for least squares. We can phrase the problem of optimizing the centers in this way; that is, we let

$$r = (I-AA^\dagger)\bar{y}$$

where

$$A = \begin{bmatrix} K_{XU} & \Pi_X \\ \sigma I & 0 \end{bmatrix}, \quad

\bar{y} = \begin{bmatrix} y \\ 0 \end{bmatrix}$$

is treated as a function of the center locations. The regularization parameter $\sigma$ plays a critical role in this particular problem, so we are careful to keep it throughout!

To save us all some trouble, I will provide the code to compute $r$ and the Jacobian with respect to the center coordinates.

"""

567 μs

spline_rproj (generic function with 1 method)

function spline_rproj(u, xx, y; σ=1e-3)

d, m = size(u)

d, n = size(xx)

resid_ls!(qr([spline_K(xx, u) spline_Π(xx); σ*I zeros(m,d+1)]), [y; zeros(m)])

end

1.2 ms

spline_Jproj (generic function with 1 method)

function spline_Jproj(u, xx, y; σ=1e-3)

d, m = size(u)

d, n = size(xx)

# Compute c and r

A = [spline_K(xx, u) spline_Π(xx); σ*I zeros(m,d+1)]

F = qr(A)

c, r = solve_ls!(F, [y; zeros(m)])

# Compute W = -Q'*Jf_proj as in the 4-11 notes

JA = [Dspline_K(xx, u); zeros(m,m*d)]

W = F.Q'*JA

z = JA'*r

invRT = inv(F.R')

# Process in groups of d columns (associated with each column of u)

for j = 1:m

J = (j-1)*d+1:j*d

W[:,J] .*= c[j]

W[1:m+d+1,J] = invRT[:,j] * z[J]'

end

# Return f_proj and the Jacobian of f_proj

-(F.Q * W)

end

2.1 ms

As usual, we need a finite difference check to prevent programming errors.

Finite difference relerr in Jacobian computation: 1.6565625055732684e-7

# Finite difference check for variable projection Jacobian computation

let

h = 1e-5

δu = randn(size(uqrand))

rp = spline_rproj(uqrand+h*δu, xx_data, y_data)

rm = spline_rproj(uqrand-h*δu, xx_data, y_data)

Jδu_fd = (rp-rm)/(2h)

Jδu = spline_Jproj(uqrand, xx_data, y_data)*δu[:]

md"""

As usual, we need a finite difference check to prevent programming errors.

- Finite difference relerr in Jacobian computation: $(norm(Jδu_fd-Jδu)/norm(Jδu))

"""

end

538 ms

Tasks

Use the provided Levenberg-Marquardt solver to refine the center locations picked by the forward selection algorithm. This is a tricky optimization, and you may need to fiddle with solver parameters to get something you are happy about. As usual, you should also provide a convergence plot.

md"""

##### Tasks

"""

146 μs

Newton refinement

Levenberg-Marquardt is popular for nonlinear least squares problems because it only requires first derivatives and often still gives very robust convergence. But the convergence behavior for this problem is disappointing, and we have all the derivatives we need if we only have the fortitude to use them.

As in the Levenberg-Marquardt case, I will not require you to write the function evaluation and derivatives.

md"""

## Newton refinement

As in the Levenberg-Marquardt case, I will not require you to write the function evaluation and derivatives.

"""

172 μs

spline_ρproj (generic function with 1 method)

function spline_ρproj(u, xx, y; σ=1e-3)

d, m = size(u)

d, n = size(xx)

# Compute c and r (and ρ)

A = [spline_K(xx, u) spline_Π(xx); σ*I zeros(m,d+1)]

F = qr(A)

c, r = solve_ls!(F, [y; zeros(m)])

ρ = (r'*r)/2

# Compute column derivatives

JA = [Dspline_K(xx, u); zeros(m,m*d)]

# Compute γ_i = A_,i c and s_i = A_,i^T r - A^T γ_i

Γ = copy(JA)

S = zeros(m+d+1, m*d)

for j = 1:m

J = (j-1)*d+1:j*d

Γ[:,J] .*= c[j]

S[j,J] = JA[:,J]'*r

end

S -= A'*Γ

# Set up Hessian

W = F.R'\S

Hρ = Γ'*Γ - W'*W

for i=1:n

for j=1:m

J = (j-1)*d+1:j*d

Hρ[J,J] -= r[i]*c[j]*Hϕ(u[:,j]-xx[:,i])

end

ρ, -Γ'*r, Hρ

end

2.9 ms

Finite difference checks:

For the gradient: 7.256850871224047e-8
For the Hessian: 2.4552525270252068e-8

# Finite difference checks for full Newton setup

let

h = 1e-5

δu = randn(size(uqrand))

ρp, ∇ρp, Hρp = spline_ρproj(uqrand+h*δu, xx_data, y_data)

ρm, ∇ρm, Hρm = spline_ρproj(uqrand-h*δu, xx_data, y_data)

ρ, ∇ρ, Hρ = spline_ρproj(uqrand, xx_data, y_data)

∇ρ_δu = ∇ρ'*δu[:]

∇ρ_δu_fd = (ρp-ρm)/(2h)

Hρ_δu = Hρ*δu[:]

Hρ_δu_fd = (∇ρp-∇ρm)/(2h)

md"""

Finite difference checks:

- For the gradient: $(norm(∇ρ_δu-∇ρ_δu_fd)/norm(∇ρ_δu))

- For the Hessian: $(norm(Hρ_δu-Hρ_δu_fd)/norm(Hρ_δu))

"""

end

752 ms

Tasks

Use the provided Newton solver to refine the center locations picked by one of the earlier algorithms (I recommend using the Levenberg-Marquardt output as a starting point). Give a convergence plot – do you see quadratic convergence?

md"""

##### Tasks

"""

177 μs