Using Load-balancing during Recovery

Problem Description

In almost all of our discussions of parallel checkpointing and recovery, we have assumed that the number of processors at checkpoint-time and recovery-time are the same. In other words, when a failed processors are always replaced. Perhaps a more realistic assumption is that the failed processors are not replaced. This raises the question of where the state of the failed processors should be recovered. An obvious solution is to recover the state of failed processors on some of the remaining processors. This could be done by creating additional processor or threads on these processors and initializing them with the checkpoints of the failed processors. The disadvantage of this approach is that unless something is done, some of the remaining processors will now have twice (or more) the amount of work as other processors. In this case, the lightly loaded processors will perform better than the heavily loaded processors.

In the parallel programming literature, this situation is referred to as load imbalance. Load balancing techniques have been developed to dynamically migrate work between processors in order to ensure that the work load is evenly distributed among the processors. One can imagine running a load balancing system after the recovery procedure in order to distribute the work of the failed processors among the remaining nodes.

Project Goals

The goal of this project is to incorporate checkpoint/recovery functionality into, PREMA, an existing load balancing system. Ideally, programs written for PREMA could adapt to the failure of some small number of processors simply by using your augmented version of PREMA.

What you need to do

Read the papers listed below.
Implement checkpointing and recovery of mobile objects within PREMA.
Devise and implement a recovery strategy that will restore checkpointed mobile objects and rebalance the computation.
Write some programs that demonstrate the features of your system and run some experiments to measure your system's performance.

References

Experiences from Integrating Algorithmic and Systemic Load Balancing Strategies, I. Banicescu, S. Ghafoor, V. Velusamy, S. Russ, M. Bilderback. Concurrency and Computation: Practice and Experience, John Wiley & Sons, Ltd., Vol.13(2), pp.121-139, February 2001. [pdf]
Hectiling: An Integration of Fine and Coarse-Grained Load-Balancing Strategies, S. Russ, I. Banicescu, S. Ghafoor, B. Janapareddi, J. Robinson and R. Lu. In Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing (HPDC'98), pp. 106-113, Chicago, July 1998. [pdf]
Data Movement and Control Substrate for Parallel Adaptive Applications. Jeff Dobbelaere, Kevin Barker, Nikos Chrisochoides, Demian Nave, and Keshav Pingali. Submitted to Concurrency Practice and Experience, March 2001. [pdf]
Mobile Object Layer: A Runtime Substrate for Parallel Adaptive and Irregular Computations. Nikos Chrisochoides, Kevin Barker, Demian Nave, Chris Hawblitzel. Advances in Engineering Software, Vol 31 (8-9), pp. 621-637, August, 2000. [pdf]