Edit or run this notebook

Project 3: Adaptive Splines

Due: 2022-04-22

A polyharmonic spline is a type of function approximator for functions from RdR of the form

s(x)=j=1mϕ(xuj)aj+γ1:dTx+γd+1

where

ϕ(ρ)={ρklog(ρ),k even,ρk,k odd.

Common examples include cubic splines (k=3) and thin plate splines (k=2). We will consider cubic splines here.

We will touch on two optimization questions in this project. First, how do we use splines to help optimization? Second, how do we best choose the coefficients a and γ and the centers {uj}j=1m?

1.1 ms

Code setup

In the interest of making all our lives easier, we have a fair amount of starter code for this project. You are not responsible for all of it in detail, but I certainly recommend at least becoming familiar with the interfaces!

192 μs
269 μs
2.9 s

Basis functions and evaluation

As a point of comparison, we will want to look at the ordinary spline procedure (where there are as many centers as there are data points, and data is given at the centers). To start with, we need ϕ and its derivatives.

181 μs
Hϕ (generic function with 1 method)
1.0 ms

When evaluating the spline will be convenient to refer to a and γ together in one vector, which we will call c.

145 μs
spline_eval (generic function with 1 method)
1.3 ms
spline_eval (generic function with 2 methods)
1.5 ms

We are frequently going to care about the residual on our test function (over many evaluation points), so we also provide small helpers for that computation and for plotting the spline and the residual.

120 μs
plot_spline (generic function with 1 method)
793 μs

Fitting the spline

We consider two methods for fitting the spline. For both, we need to compute two matrices:

  • The kernel matrix KXU where (KXU)ij=ϕ(xiuj)

  • The polynomial part ΠX consisting of rows of the form [xiT1]

1.5 ms
spline_Π (generic function with 1 method)
878 μs

Later, we will also have need for the Jacobian of KXU with respect to the components of u.

148 μs
Dspline_K (generic function with 1 method)
1.2 ms

Standard spline fitting

In the standard scheme, we have one center at each data point. Fitting the spline involves solving the system

[K+σ2IΠΠT0][aγ]=[y0].

The first n equations are interpolation conditions. The last d+1 equations are sometimes called the "discrete orthogonality" conditions. When the regularization parameter σ2 is nonzero, we have a smoothing spline.

227 μs
spline_fit (generic function with 1 method)
1.2 ms
spline_fit (generic function with 2 methods)
1.2 ms

Least squares fitting

When the number of data points is large, we may want to use a smaller number of centers as the basis for our approximation. That is, we consider an approximator of the form

s(x)=j=1nϕ(xuj)aj+γ1:dTx+γd+1

where the locations uj are decoupled from the xi. We determine the coefficients a and γ by a least squares problem

minimize [KXUΠ][aγ]y2+σ2a2

216 μs
spline_fit (generic function with 3 methods)
1.2 ms
spline_fit (generic function with 4 methods)
1.1 ms

Sampling strategies

Splines are fit to sampled data, so we want a number of ways to draw samples. Our example is 2D, so we will stick to 2D samplers.

We start with samplers on a regular mesh of points (xi,yj) (sometimes called a tensor product mesh). These get big very fast in high-dimensional spaces, but are fine in 2D.

187 μs
meshgrid (generic function with 1 method)
519 μs
meshgrid_uniform (generic function with 1 method)
435 μs
meshgrid_cheb (generic function with 1 method)
791 μs

In higher-dimensional spaces, it is tempting to choose random samples. But taking independent uniform draws is not an especially effective way of covering a space – random numbers tend to clump up. For this reason, low discrepancy sequences are often a better basis for sampling than (pseudo)random draws. There are many such generators; we use a relatively simple one based on an additive recurrence with a multiplier based on the "generalized golden ratio".

221 μs
kronecker_quasirand (generic function with 2 methods)
2.4 ms

We will provide both random and quasi-random samplers.

106 μs
qrand2d (generic function with 1 method)
558 μs

Optimizers

Most of the time if you are using standard optimizers like Levenberg-Marquardt or Newton, you should use someone else's code and save your ingenuity for problem formulation and initial guesses. In this spirit, we provide two solvers that we will see in class in a couple weeks: a Levenberg-Marquardt implementation and a Newton implementation.

We will see the Levenberg-Marquardt algorithm for nonlinear least squares as part of our regular lecture material, but we also include it here. This version uses logic for dynamically scaling the damping parameter (courtesy Nocedal and Wright).

187 μs
levenberg_marquardt (generic function with 1 method)
3.0 ms

The trust-region Newton solver optimizes a quadratic model over some region in which it is deemed trustworthy. The model does not have to be convex!

113 μs
solve_tr (generic function with 1 method)
2.2 ms
tr_newton (generic function with 1 method)
2.9 ms

Incremental QR and forward selection

The forward selection algorithm is a factor selection process for least squares problems in which factors are added to a model in a greedy fashion in order to minimize the residual at each step.

For fast implementation, it will be convenient for us to be able to extend a QR factorization by adding additional columns to an existing factorization. This is easy enough to do in practice, but requires a little poking around in the guts of the Julia QR code.

266 μs
windowed_qr (generic function with 1 method)
307 μs
windowed_qr (generic function with 2 methods)
665 μs

We also will want to compute residuals for many of our algorithms.

110 μs
resid_ls! (generic function with 1 method)
513 μs
resid_ls! (generic function with 2 methods)
260 μs
solve_ls! (generic function with 1 method)
595 μs
forward_selection (generic function with 1 method)
2.1 ms

Some sanity checks on a random test case (keep first 5, choose 5 more):

  • 1 2 3 4 5 11 12 20 17 6

    selected

  • Does A[:,I] = QR? Relerr 3.733265243763775e-16

  • Does r = b-A[:,I]*x? Relerr 5.105187043779148e-16

  • Does A[:,I]'*r = 0? Relerr 1.2869532583151618e-17

1.0 s

Test function

The Himmelblau function is a standard multi-modal test function used to explore the properties of optimization methods. We will take the log of the Himmelblau function (shifted up a little in order to avoid log-of-zero issues) as our running example.

170 μs
1.8 s

The Himmelblau function has a local maximum at (0.270845,0.923039) and four global minimizers:

g(3.0,2.0)=0.0,g(2.805118,3.131312)=0.0,g(3.779310,3.283186)=0.0,g(3.584428,1.848126)=0.0.

121 μs
Tasks
  1. Express the Himmelblau function as a nonlinear least squares function and use the provided Levenberg-Marquardt solver to find the minimum at (3,2) from a starting guess at (2,2). Use the monitor function to produce a semilog convergence plot.

  2. Run ten steps of an (unguarded) Newton solver for the Himmelblau function. It may save you some algebra to remember the relation between the Hessian of a nonlinear least squares function and the Gauss-Newton approximation. What does it converge to from the starting guess (2,2)? Plot convergence (gradient norm vs iteration) again.

  3. Run the same experiment with the Newton solver, but this time using the trust region version that is provided.

304 μs

Optimizing on a surrogate

Let's now fit a surrogate to the Himmelblau function sampled on 50 samples drawn from our quasi-random number generator.

155 μs
652 ms

Does the spline have minima close to the minima of the true function? The answer depends how "nice" the minimizers are and how good the approximation error is. Concretely, suppose |s(x)g(x)|<δ over xΩ and x is a minimizer of g. If minu=1g(x+ρu)g(x)>2δ and the ball of radius ρ around x lies entirely inside Ω, then minu=1s(x+ρu)s(x)>0. Assuming continuity, this is enough to guarantee that s has a minimizer in the unit ball of radius ρ around x. If g has a Lipschitz continuous Hessian with Lipschitz constant M and ρ<λmin(Hg(x))/M, then a sufficient condition is that 12ρ2(λmin(Hg(x))Mρ)>2δ (still assuming the ball of radius ρ lies within the domain Ω).

(We can get better bounds if we can control the error in the derivative approximation, but we will leave this alone in the current assignment.)

293 μs

Putting aside theoretical considerations for a moment, let's try to find the minimizer (or at least a local minimizer) of the spline approximation s(x). We can make a first pass at this by sampling on a mesh and taking the point where we see the smallest function value.

128 μs

Best point found at 3.473684210526316, -0.9473684210526315

670 ms
Tasks
  1. Let x be a strong local minimizer of g, where g is twice continuously differentiable and the Hessian Hg has a Lipschitz constant M in the operator 2-norm, and that ρ<λmin(Hg(x))/M. Using a Taylor expansion about x, show that 12ρ2(λmin(Hg(x))Mρ)>2δ is sufficient to guarantee that minu=1g(x+ρu)g(x)>2δ.

  2. Compute the smallest singular value of the Hessian of the log-transformed Himmelblau function at (3,2). Also compute the approximation error in the spline fit at (3,2) and use that as an approximate δ and plug into the estimate

ρ~=2δλmin(Hg(x))

  1. Take ten steps of Newton to find minima the minimum of the spline close to the point found above. Plot the gradient norm on a semilog scale to verify quadratic convergence. What happens with a starting guess of (3,2)?

For the first task, you may use without proof that λmin(A+E)λmin(A)E2 when A and E are symmetric. For the second question, remember that the log transform we used is glog(g+10). You should also not expect to get a tiny value for ρ~!

439 μs

Non-adaptive sampling strategies

If we want to use a spline as a surrogate for optimization, we need it to be sufficiently accurate, where what is "sufficient" is determined by the behavior of the Hessian near the desired minima. For the rest of this assignment, we focus on making more accurate splines.

We start by defining a highly refined "ground truth" mesh of 104 points laid out in a mesh. Our figure of merit will be the root mean squared approximation error. We also define a coarser "data mesh" that we will use for least squares spline fits.

222 μs
42.7 ms

We would like to choose a sample set that is going to minimize the RMS error. We will talk about adapting the sample later; for now, let's consider different ways that we could lay out a set of about 50 sample points:

  • We could choose a 7-by-7 mesh (uniform or Chebyshev)

  • We could choose 50 points at random

  • We could choose 50 points from a low-discrepancy sequence (quasi-random)

Let's set up an experimental framework to compare the RMS error on the ground truth mesh for each of these approaches.

305 μs
test_sample (generic function with 1 method)
3.4 ms
test_sample (generic function with 2 methods)
630 μs
Tasks

Run experiments with the test_sample functions for both the standard spline fitting and least squares fitting against the 4000-point data set. Comment on anything you observe from your experiments. Feel free to play with the parameters (e.g. number of points).

164 μs

Forward stepwise regression

Forward stepwise regression is a factor selection method in which we greedily choose the next factor from some set of candidates based on which candidate will most improve the solution. We have provided you with a reference implementation of the algorithm. Your task is to use this method to choose centers from a candidate set of points in order to improve your least squares fit.

184 μs
Tasks

Complete the spline_forward_regression function below, then run the code with 500 points chosen from our quasirandom sequence. Run the test harness above to see how much the RMS error measure improves.

150 μs
spline_forward_regression (generic function with 1 method)
239 μs

Levenberg-Marquardt refinement

In the lecture notes from 4/11, we discuss the idea of variable projection algorithms for least squares. We can phrase the problem of optimizing the centers in this way; that is, we let

r=(IAA)y¯

where

A=[KXUΠXσI0],y¯=[y0]

is treated as a function of the center locations. The regularization parameter σ plays a critical role in this particular problem, so we are careful to keep it throughout!

To save us all some trouble, I will provide the code to compute r and the Jacobian with respect to the center coordinates.

567 μs
spline_rproj (generic function with 1 method)
1.2 ms
spline_Jproj (generic function with 1 method)
2.1 ms

As usual, we need a finite difference check to prevent programming errors.

  • Finite difference relerr in Jacobian computation: 1.6565625055732684e-7

538 ms
Tasks

Use the provided Levenberg-Marquardt solver to refine the center locations picked by the forward selection algorithm. This is a tricky optimization, and you may need to fiddle with solver parameters to get something you are happy about. As usual, you should also provide a convergence plot.

146 μs

Newton refinement

Levenberg-Marquardt is popular for nonlinear least squares problems because it only requires first derivatives and often still gives very robust convergence. But the convergence behavior for this problem is disappointing, and we have all the derivatives we need if we only have the fortitude to use them.

As in the Levenberg-Marquardt case, I will not require you to write the function evaluation and derivatives.

172 μs
spline_ρproj (generic function with 1 method)
2.9 ms

Finite difference checks:

  • For the gradient: 7.256850871224047e-8

  • For the Hessian: 2.4552525270252068e-8

752 ms
Tasks

Use the provided Newton solver to refine the center locations picked by one of the earlier algorithms (I recommend using the Levenberg-Marquardt output as a starting point). Give a convergence plot – do you see quadratic convergence?

177 μs
Loading...
ii