Application-level Check-pointing of MPI Code
Given an MPI message-passing program in C or FORTRAN, transform it into
an equivalent program with application-level check-pointing.
Here are some ideas for this project. Depending on how much you want
do, this can be a 2 or 3 person project.
As we discussed in class, the most important parameter in coming up
with a solution is whether the number of processes before failure is equal to
the number of processes after recovery. To begin with, assume that these two
numbers are equal.
- Assume that you have application-level check-pointing for each process.
What strategy will you use to extend it to the MPI parallel program? A
simple but perhaps inefficient thing to do is to implement un-coordinated
check-pointing. That is, each process saves application-level
state whenever it wants to, and when a process fails, everyone co-operates
to find the recovery line. Make sure you save in-flight messages as
explained in class. How well does this work in practice? Are exponential
roll-backs a problem in practice?
- We know that in practice, many message-passing programs are written in an
SPMD, "bulk synchronous" way. Roughly speaking, this means that
computation proceeds in steps; processes exchange data at the start of
each step, compute independently, and then synchronize at the end of that
step. For such programs, you can implement coordinated blocking
check-pointing by doing the check-pointing at the end of steps at the
point of synchronization. Is there an automatic way to determine if an MPI
program is bulk synchronous in this sense? If you are told that it is, can
you determine where check-points should be placed? What must be saved? For
the ab initio protein-folding code, can you determine that it is sufficient
to save the positions and velocities of all bases at the end of each
time-step?
- Is there a way to do coordinated non-blocking check-pointing at the
application level? In class, we presented the Chandy-Lamport algorithm for
taking distributed snapshots. Recall that in the simplest version, a
process saved its state whenever it saw the first marker. This is fine if we
are saving system-level state, but how should things work if we want to save
application-level state? After all, a process may not be at a point where it
can save its state when it sees the first marker. Should we abandon the
notion of consistent cuts and try to recover from a cut that is not
consistent? In practice, should we use coordinated blocking check-pointing
or coordinated non-blocking check-pointing?
- Can you implement incremental check-pointing at the application
level? How much benefit is there to incremental check-pointing?
- As we discussed in class, you do not have to save all live variables at
a check-point if you are willing to re-compute the values of some of
these variables in your recovery script. What kind of analysis is required
to implement this? This is a space-time trade-off - we are permitting
ourselves extra time during recovery to re-compute some information so we
can save less information when we take check-points. Develop a performance
model to enable you to make this trade-off intelligently.
Now assume that you must recover with a smaller number of
processes than you had before failure. What strategy would you use to
accomplish this? Among other things, you will need a sophisticated runtime
system that can remap data across processors. We can give you access to such a
runtime system, but you need to figure out how all this will work together. If
it simplifies your job, assume that programs are written corresponding to some
programming model of your choice. What should such a model be and what do you
need in the runtime system to support this model on top of MPI?