Application-level Check-pointing of MPI Code

 Given an MPI message-passing program in C or FORTRAN, transform it into an equivalent program with application-level check-pointing.

 Here are some ideas for this project. Depending on how much you want do, this can be a 2 or 3 person project.

 As we discussed in class, the most important parameter in coming up with a solution is whether the number of processes before failure is equal to the number of processes after recovery. To begin with, assume that these two numbers are equal.

  Now assume that you must recover with a smaller number of  processes than you had before failure. What strategy would you use to accomplish this? Among other things, you will need a sophisticated runtime system that can remap data across processors. We can give you access to such a runtime system, but you need to figure out how all this will work together. If it simplifies your job, assume that programs are written corresponding to some programming model of your choice. What should such a model be and what do you need in the runtime system to support this model on top of MPI?