Application-level check-pointing of sequential programs

Given a sequential program in C or FORTRAN, transform it into an equivalent program with application-level check-pointing. You should be able to save the state of the computation in a system-independent way so that after failure, you can restart the computation on a different machine. Depending on how ambitious you want to be, this can be a 1 or 2 person project.

You can do this project in three stages.

Assume that the programmer specifies where check-points are to be taken, and what variables are live at each check-point. Generate the recovery script and weave it into the text of the program. What format will you use to save information in an application-independent manner?
Assume that the programmer only specifies where check-points are to be taken. Using live variable analysis, figure out what variables are live at each check-point. Since analysis is not exact in general, you will have to estimate what is live in a conservative way. How accurate is your analysis? Once you have done the analysis, you can use your implementation from step 1 to actually do the check-pointing.
Assume the programmer specifies nothing, and that the compiler has to figure out where check-points should be inserted. How often should you take check-points? What run-time information may be useful for optimizing this entire process?

There are many enhancements you can make to this basic scheme. Here are some possibilities.

Can you implement incremental check-pointing at the application level? Intuitively, this requires taking "finite differences" of live variable information from one check-point to the next. How much benefit is there to incremental check-pointing?
As we discussed in class, you do not have to save all live variables at a check-point if you are willing to re-compute the values of some of these variables in your recovery script. What kind of analysis is required to implement this? This is a space-time trade-off - we are permitting ourselves extra time during recovery to re-compute some information so we can save less information when we take check-points. Develop a performance model to enable you to make this trade-off intelligently.

Measure of success: you should be able to take an ab initio code for protein folding for example, and determine automatically that it is enough to save the positions and velocities of all bases at each time-step. Look at other bench-marks to determine similar measures of success.