Untitled Document

Application-level Checkpointing Protocols for Shared Memory

Project contact: Greg Bronevetsky

Description of project:

Our research group has implemented a system for making message-passing programs resilient to hardware faults. The approach we have taken is to use application-level co-ordinated checkpointing. The source code of the application is modified so that each process can save its own state at certain key points during execution. To co-ordinate the taking of checkpoints across the whole system, we have implemented a thin co-ordination layer that sits between the application program and the MPI layer. This layer implements a new co-ordination protocol developed by our group, and it works by intercepting all MPI calls made by the application and performing its book-keeping functions before passing the calls onto the underlying MPI implementation. We have an implementation that can handle most features of MPI, and we believe can extend it to handle all of MPI.

Although distributed-memory processing is by far the most commonly used parallel programming paradigm, hardware shared-memory processing is becoming increasingly important. Each node of a distributed-memory computer is sometimes a 4-way or 8-way hardware shared-memory parallel machine, and people have started to take advantage of this hardware by writing programs that are a mixture of MPI and shared-memory constructs. How would you use application-level checkpointing in this context? How would you integrate it with the distributed-memory protocol we already have? We suggest you proceed in these steps.

What you need to do:

Read the PPoPP'03 Bronevetsky et al paper on the basic distributed-memory co-ordination protocol for application-level checkpointing.
Extend the Bronevetsky et al protocol to handle the shared-memory constructs in MPI-2 (mainly put and get). Put and get provide one-sided communication primitives for doing remote writes and reads.
MPI-2 introduced a number of other constructs that we would like you to read and summarize for us. Do we need to extend the co-ordination layer to support these constructs? If so, how?
The next step is to handle hardware shared-memory which is usually implemented using coherent caches. Study the OpenMP standard. What kind of application-level fault tolerance protocol is needed to support OpenMP? This part of the project is fairly open-ended because we have not yet thought about shared-memory in much detail. For example, some of the papers listed below use hardware support for shared-memory fault-tolerance. Feel free to re-orient this part as you see fit.
How would you integrate the shared-memory protocol and the distributed-memory protocols?

Papers:

Bronovetsky et al paper
MPI Forum
OpenMP
A paper by Torrellas et al on hardware-supported shared-memory fault-tolerance
A recent PhD thesis on hardware-supported shared-memory fault-tolerance