Handling Byzantine Failures Transparently

Project Description

Many scientific applications require extremely long running times. Developers expect the hardware to fail before the computation finishes. Thus far, most of the research energy has looked into the "Fail-Stop" model where a faulty processor simply stops working. The fail-stop can be a result of many circumstances, such as network failures, scheduling issues, or real failure. In any case, it is immediately detectable when such a failure occurs.

Byzantine failures on the other hand occur when a processor returns incorrect or malicious data. Detecting this error is harder than the fail-stop model -- at least one other processor must do the same computation to confirm the results. Some approaches to the fail-stop model may be adaptable to the Byzantine failure model. One such approach is computing checksums, used to assist rollback in the fail-stop model.

What isn't clear yet, is just how much validation is required to determine whether a Byzantine failure has occurred. With three-processors running the same computation, as long as two of them are in agreement, no roll-back is necessary. However, as more processors are devoted to the same computation, the running time increases therefore increasing the likelihood of a hardware failure. If only two processors run the same computation, the question turns to which one is correct, and how far back do we have to rollback?

Project Goals

The goal of this project is to gain insight into the many methods for handling Byzantine failures. The first and foremost problem is efficiently detecting and isolating a Byzantine failure, immediately followed by the best method for recovering from the failure. The methods to be investigated involve the use of checksums, transparently put into the compiled program. The programs considered here should already be parallelized. The failure model is simply any time when a processor returns an incorrect result -- either maliciously or "accidentally".

What you need to do

Read the papers on fail-stop fault tolerance.
Taking an already parallel version of MM and MVM, hand instrument the code with techniques for detecting and recovering from Byzantine failures. Some techniques include:
- Pairing up the processors and using checksums to verify they are in synch, and recovering by starting computation over with three processors.
- Devoting three processors to a computation, and continuing computation as long as two processors agree on a checksum.
Design the interface for determining a point at which processors should check each others checksums.
Conjecture on how this can be implemented in a compiler using program analysis, and evaluate the overhead.

References

"Fault-Tolerant High-Performance Matrix Multiplication", J. Gunnels, and D. Katz.
"Fault Tolerant Matrix Operations Using Checksum and Reverse Computation", Y. Kim, J. Plank, and J. Dongarra.
"Fault-Tolerant Execution of Computationally and Storage Intensive Parallel Programs Over A Network Of Workstations: A Case Study", J. Smith, and S. Shrivastava.