Handling Byzantine Failures Transparently

Project Description

Many scientific applications require extremely long running times.  Developers expect the hardware to fail before the computation finishes.  Thus far, most of the research energy has looked into the "Fail-Stop" model where a faulty processor simply stops working.  The fail-stop can be a result of many circumstances, such as network failures, scheduling issues, or real failure.  In any case, it is immediately detectable when such a failure occurs.

Byzantine failures on the other hand occur when a processor returns incorrect or malicious data.  Detecting this error is harder than the fail-stop model -- at least one other processor must do the same computation to confirm the results.  Some approaches to the fail-stop model may be adaptable to the Byzantine failure model.  One such approach is computing checksums, used to assist rollback in the fail-stop model.

What isn't clear yet, is just how much validation is required to determine whether a Byzantine failure has occurred.  With three-processors running the same computation, as long as two of them are in agreement, no roll-back is necessary.  However, as more processors are devoted to the same computation, the running time increases therefore increasing the likelihood of a hardware failure.  If only two processors run the same computation, the question turns to which one is correct, and how far back do we have to rollback?

Project Goals

The goal of this project is to gain insight into the many methods for handling Byzantine failures.  The first and foremost problem is efficiently detecting and isolating a Byzantine failure, immediately followed by the best method for recovering from the failure.  The methods to be investigated involve the use of checksums, transparently put into the compiled program.  The programs considered here should already be parallelized.  The failure model is simply any time when a processor returns an incorrect result -- either maliciously or "accidentally".

What you need to do

References