Application-level check-pointing of sequential programs

Given a sequential program in C or FORTRAN, transform it into an equivalent program with application-level check-pointing. You should be able to save the state of the computation in a system-independent way so that after failure, you can restart the computation on a different machine. Depending on how ambitious you want to be, this can be a 1 or 2 person project.

You can do this project in three stages.

  1. Assume that the programmer specifies where check-points are to be taken, and what  variables are live at each check-point. Generate the recovery script and weave it into the text of the program. What format will you use to save information in an application-independent manner?
  2. Assume that the programmer only specifies where check-points are to be taken. Using live variable analysis, figure out what variables are live at each check-point. Since analysis is not exact in general, you will have to estimate what is live in a conservative way. How accurate is your analysis?  Once you have done the analysis, you can use your implementation from step 1 to actually do the check-pointing.
  3. Assume the programmer specifies nothing, and that the compiler has to figure out where check-points should be inserted. How often should you take check-points? What run-time information may be useful for optimizing this entire process?

There are many enhancements you can make to this basic scheme. Here are some possibilities.

Measure of success: you should be able to take an ab initio code for  protein folding for example, and determine automatically that it is enough to save the positions and velocities of all bases at each time-step. Look at other bench-marks to determine similar measures of success.