Application

Application-level Check-pointing of MPI Code

Given an MPI message-passing program in C or FORTRAN, transform it into an equivalent program with application-level check-pointing.

Here are some ideas for this project. Depending on how much you want do, this can be a 2 or 3 person project.

As we discussed in class, the most important parameter in coming up with a solution is whether the number of processes before failure is equal to the number of processes after recovery. To begin with, assume that these two numbers are equal.

Assume that you have application-level check-pointing for each process. What strategy will you use to extend it to the MPI parallel program? A simple but perhaps inefficient thing to do is to implement un-coordinated check-pointing. That is, each process saves application-level state whenever it wants to, and when a process fails, everyone co-operates to find the recovery line. Make sure you save in-flight messages as explained in class. How well does this work in practice? Are exponential roll-backs a problem in practice?
We know that in practice, many message-passing programs are written in an SPMD, "bulk synchronous" way. Roughly speaking, this means that computation proceeds in steps; processes exchange data at the start of each step, compute independently, and then synchronize at the end of that step. For such programs, you can implement coordinated blocking check-pointing by doing the check-pointing at the end of steps at the point of synchronization. Is there an automatic way to determine if an MPI program is bulk synchronous in this sense? If you are told that it is, can you determine where check-points should be placed? What must be saved? For the ab initio protein-folding code, can you determine that it is sufficient to save the positions and velocities of all bases at the end of each time-step?
Is there a way to do coordinated non-blocking check-pointing at the application level? In class, we presented the Chandy-Lamport algorithm for taking distributed snapshots. Recall that in the simplest version, a process saved its state whenever it saw the first marker. This is fine if we are saving system-level state, but how should things work if we want to save application-level state? After all, a process may not be at a point where it can save its state when it sees the first marker. Should we abandon the notion of consistent cuts and try to recover from a cut that is not consistent? In practice, should we use coordinated blocking check-pointing or coordinated non-blocking check-pointing?
Can you implement incremental check-pointing at the application level? How much benefit is there to incremental check-pointing?
As we discussed in class, you do not have to save all live variables at a check-point if you are willing to re-compute the values of some of these variables in your recovery script. What kind of analysis is required to implement this? This is a space-time trade-off - we are permitting ourselves extra time during recovery to re-compute some information so we can save less information when we take check-points. Develop a performance model to enable you to make this trade-off intelligently.

Now assume that you must recover with a smaller number of processes than you had before failure. What strategy would you use to accomplish this? Among other things, you will need a sophisticated runtime system that can remap data across processors. We can give you access to such a runtime system, but you need to figure out how all this will work together. If it simplifies your job, assume that programs are written corresponding to some programming model of your choice. What should such a model be and what do you need in the runtime system to support this model on top of MPI?