Checkpointing and fault-tolerance with Egida

Project Description

The most popular communications interface standard for parallel applications is MPI. MPI provides functionality for point-to-point as well as collective communication, and supercomputer vendors provide MPI with their systems and versions are also available for clusters of PC's (both UNIX and Windows). Thousands, if not tens of thousands, of applications have been written using this standard.

Many of these applications either need to run for a very long time (days or even months at a time) or in loosely coupled computing environments (e.g., networks of workstations). In both cases, it is frequently the case that "failures" can be expected to occur during the run of applications. These can be failures in the usual sense, like hardware or network failures, or simply unanticipatable events. For instances, the owner of a desktop computer in a network of workstations should be allowed to kick a running application of their computer.

Methods for Checkpoint/Restart and Message-logging have been developed to make applications fault-tolerant. At a very simplistic level, these techniques can be divided into two sets of concerns,

The Egida system, developed at the University of Texas at Austin, is a version of MPI (based on MPICH), which has been modified to include checkpointing and message-logging functionality. Egida is notable for two features. First, it is intended to be transparent to the user. Simply by recompiling their application, the user is able to make it fault-tolerant. Second, it is set up as "toolbox" of modules that can be composed to develop novel and application-specific checkpointing or message-logging protocols.

Project Goals

There are two goals to this project.

The first is to learn and understand the Egida toolkit and use it to build and test existing checkpointing protocols and hopefully gain insights into better protocols for transparent checkpointing. 

The second is to investigate how exposing the checkpointing functionality to the application would improve the performance over transparent checkpointing.

What you need to do

References