Application-level Check-pointing of MPI Code
 Given an MPI message-passing program in C or FORTRAN, transform it into
an equivalent program with application-level check-pointing. 
 Here are some ideas for this project. Depending on how much you want
do, this can be a 2 or 3 person project.
 As we discussed in class, the most important parameter in coming up
with a solution is whether the number of processes before failure is equal to
the number of processes after recovery. To begin with, assume that these two
numbers are equal.
  - Assume that you have application-level check-pointing for each process.
    What strategy will you use to extend it to the MPI parallel program? A
    simple but perhaps inefficient thing to do is to implement un-coordinated
    check-pointing.  That is, each process saves application-level
    state whenever it wants to, and when a process fails, everyone co-operates
    to find the recovery line. Make sure you save in-flight messages as
    explained in class. How well does this work in practice? Are exponential
    roll-backs a problem in practice? 
 
  - We know that in practice, many message-passing programs are written in an
    SPMD, "bulk synchronous" way. Roughly speaking, this means that
    computation proceeds in steps;  processes exchange data at the start of
    each step, compute independently, and then synchronize at the end of that
    step. For such programs, you can implement coordinated blocking
    check-pointing by doing the check-pointing at the end of steps at the
    point of synchronization. Is there an automatic way to determine if an MPI
    program is bulk synchronous in this sense? If you are told that it is, can
    you determine where check-points should be placed? What must be saved? For
    the ab initio protein-folding code, can you determine that it is sufficient
    to save the positions and velocities of all bases at the end of each
    time-step?
 
  - Is there a way to do coordinated non-blocking check-pointing at the
    application level? In class, we presented the Chandy-Lamport algorithm for
    taking distributed snapshots. Recall that in the simplest version,  a
    process saved its state whenever it saw the first marker. This is fine if we
    are saving system-level state, but how should things work if we want to save
    application-level state? After all, a process may not be at a point where it
    can save its state when it sees the first marker. Should we abandon the
    notion of consistent cuts and try to recover from a cut that is not
    consistent? In practice, should we use coordinated blocking check-pointing
    or coordinated non-blocking check-pointing?
 
  - Can you implement incremental check-pointing at the application
    level?  How much benefit is there to incremental check-pointing?
 
  - As we discussed in class, you do not have to save all live variables at
    a check-point if you are willing to re-compute the values of some of
    these variables in your recovery script. What kind of analysis is required
    to implement this? This is a space-time trade-off - we are permitting
    ourselves extra time during recovery to re-compute some information so we
    can save less information when we take check-points. Develop a performance
    model to enable you to make this trade-off intelligently.
 
  Now assume that you must recover with a smaller number of 
processes than you had before failure. What strategy would you use to
accomplish this? Among other things, you will need a sophisticated runtime
system that can remap data across processors. We can give you access to such a
runtime system, but you need to figure out how all this will work together. If
it simplifies your job, assume that programs are written corresponding to some
programming model of your choice. What should such a model be and what do you
need in the runtime system to support this model on top of MPI?