Using Load-balancing during Recovery

Problem Description

In almost all of our discussions of parallel checkpointing and recovery, we have assumed that the number of processors at checkpoint-time and recovery-time are the same. In other words, when a failed processors are always replaced. Perhaps a more realistic assumption is that the failed processors are not replaced. This raises the question of where the state of the failed processors should be recovered. An obvious solution is to recover the state of failed processors on some of the remaining processors. This could be done by creating additional processor or threads on these processors and initializing them with the checkpoints of the failed processors. The disadvantage of this approach is that unless something is done, some of the remaining processors will now have twice (or more) the amount of work as other processors. In this case, the lightly loaded processors will perform better than the heavily loaded processors.

In the parallel programming literature, this situation is referred to as load imbalance. Load balancing techniques have been developed to dynamically migrate work between processors in order to ensure that the work load is evenly distributed among the processors. One can imagine running a load balancing system after the recovery procedure in order to distribute the work of the failed processors among the remaining nodes. 

Project Goals

The goal of this project is to incorporate checkpoint/recovery functionality into, PREMA, an existing load balancing system. Ideally, programs written for PREMA could adapt to the failure of some small number of processors simply by using your augmented version of PREMA.

What you need to do

References