Application-level check-pointing of sequential programs
Given a sequential program in C or FORTRAN, transform it into an equivalent
program with application-level check-pointing. You should be able to save the
state of the computation in a system-independent way so that after failure,
you can restart the computation on a different machine. Depending on how
ambitious you want to be, this can be a 1 or 2 person project.
You can do this project in three stages.
- Assume that the programmer specifies where check-points are to be
taken, and what variables are live at each check-point. Generate
the recovery script and weave it into the text of the program. What format
will you use to save information in an application-independent manner?
- Assume that the programmer only specifies where check-points are to be
taken. Using live variable analysis, figure out what variables are live
at each check-point. Since analysis is not exact in general, you will have
to estimate what is live in a conservative way. How accurate is your
analysis? Once you have done the analysis, you can use your
implementation from step 1 to actually do the check-pointing.
- Assume the programmer specifies nothing, and that the compiler has
to figure out where check-points should be inserted. How often should you
take check-points? What run-time information may be useful for optimizing
this entire process?
There are many enhancements you can make to this basic scheme. Here are some
possibilities.
- Can you implement incremental check-pointing at the application
level? Intuitively, this requires taking "finite differences" of
live variable information from one check-point to the next. How much benefit
is there to incremental check-pointing?
- As we discussed in class, you do not have to save all live variables at
a check-point if you are willing to re-compute the values of some of
these variables in your recovery script. What kind of analysis is required
to implement this? This is a space-time trade-off - we are permitting
ourselves extra time during recovery to re-compute some information so we
can save less information when we take check-points. Develop a performance
model to enable you to make this trade-off intelligently.
Measure of success: you should be able to take an ab initio code
for protein folding for example, and determine automatically that it is
enough to save the positions and velocities of all bases at each time-step. Look
at other bench-marks to determine similar measures of success.