Our research group has implemented a system for making message-passing programs resilient to hardware faults. The approach we have taken is to use application-level co-ordinated checkpointing. The source code of the application is modified so that each process can save its own state at certain key points during execution. To co-ordinate the taking of checkpoints across the whole system, we have implemented a thin co-ordination layer that sits between the application program and the MPI layer. This layer implements a new co-ordination protocol developed by our group, and it works by intercepting all MPI calls made by the application and performing its book-keeping functions before passing the calls onto the underlying MPI implementation. We have an implementation that can handle most features of MPI, and we believe can extend it to handle all of MPI.
Although distributed-memory processing is by far the most commonly used parallel programming paradigm, hardware shared-memory processing is becoming increasingly important. Each node of a distributed-memory computer is sometimes a 4-way or 8-way hardware shared-memory parallel machine, and people have started to take advantage of this hardware by writing programs that are a mixture of MPI and shared-memory constructs. How would you use application-level checkpointing in this context? How would you integrate it with the distributed-memory protocol we already have? We suggest you proceed in these steps.