Application-level check-pointing of sequential programs 
  Given a sequential program in C or FORTRAN, transform it into an equivalent
  program with application-level check-pointing. You should be able to save the
  state of the computation in a system-independent way so that after failure,
  you can restart the computation on a different machine. Depending on how
  ambitious you want to be, this can be a 1 or 2 person project. 
  
  You can do this project in three stages. 
  - Assume that the programmer specifies where check-points are to be
    taken, and what  variables are live at each check-point. Generate
    the recovery script and weave it into the text of the program. What format
    will you use to save information in an application-independent manner? 
 
  - Assume that the programmer only specifies where check-points are to be
    taken. Using live variable analysis, figure out what variables are live
    at each check-point. Since analysis is not exact in general, you will have
    to estimate what is live in a conservative way. How accurate is your
    analysis?  Once you have done the analysis, you can use your
    implementation from step 1 to actually do the check-pointing.
 
  - Assume the programmer specifies nothing, and that the compiler has
    to figure out where check-points should be inserted. How often should you
    take check-points? What run-time information may be useful for optimizing
    this entire process?
 
There are many enhancements you can make to this basic scheme. Here are some
possibilities.
  - Can you implement incremental check-pointing at the application
    level? Intuitively, this requires taking "finite differences" of
    live variable information from one check-point to the next. How much benefit
    is there to incremental check-pointing? 
 
  - As we discussed in class, you do not have to save all live variables at
    a check-point if you are willing to re-compute the values of some of
    these variables in your recovery script. What kind of analysis is required
    to implement this? This is a space-time trade-off - we are permitting
    ourselves extra time during recovery to re-compute some information so we
    can save less information when we take check-points. Develop a performance
    model to enable you to make this trade-off intelligently. 
 
Measure of success: you should be able to take an ab initio code
for  protein folding for example, and determine automatically that it is
enough to save the positions and velocities of all bases at each time-step. Look
at other bench-marks to determine similar measures of success.