The Byzantine Generals Problem

L. Lamport, R. Shostak and M. Pease. ACM Transactions on Programming Languages and Systems, 4(3):382-401, July 1982

Notes by Xun Wilson Huang
January 01, 2002

Overview

This is a classical paper in distributed algorithm literature that is very well-written and easy to understand. It is mostly self-contained and here I outline the main ideas and the results of the paper.

This paper first presents a deceptively simple problem, Byzantine General Problem, and prove that this problem is not solvable if the number of traitors are 1/3 or more of the total number of generals. It then presents an algorithm with oral messages that solves the problem with less than 1/3 traitors. With unforgable signatures, the problem can be solved with arbitrary number of traitors. Subsequently, the author analyzes how the apply the solutions of the problem to build reliable computer systems.

Byzantine Generals Problem(BGP)

A commanding general must send an order to his n-1 lieutenant generals such that

All loyal lieutenants obey the same order
If the commanding general is loyal, every loyal lieutenant obeys the order he sends.

A special case of BGP, 3 - General problem with 1 traitor is not solvable, because a loyal lieutenant can't not distinguish who is the traitor when he gets conflicting information from the commander and the other lieutenant.
In general, there is no solution with fewer than 3m+1 generals to cope with m traitors. Proof by reduction to 3 - general problem, with each of the Byzantine general simulating at most m lieutenants.
Reaching approximation is as hard as reaching agreement.

A solution with oral messages for n > 3m

With n > 3m ( where n is the total number of generals and m is total number of traitors), a solution with oral messages exists for BGP. A oral message system satisfies the following conditions:

Every message that is sent is delivered correctly. -> No message loss.
The receiver of a message knows who sent it. -> Completely connected network with reliable links(due to 1).
The absence of a message can be detected. -> Synchronous system only.

The algorithm is cool and fun to read, read it. Of course, the proof is fun as well. But I think it's more important to just know the result: BGP is solvable if n > 3m.

Unforgable Signatures

The difficulty of 3 - general problem lies in the ability of a traitor lieutenant to lie about the commander's order, thus if we can restrict this ability by making the following assumptions, the 3 - general problem is solvable with any number of traitors.

A loyal general's signature cannot be forged, any alteration can be detected. -> can drop a message, but can't change it
Any one can verify the authenticity of a signature. -> no one can fool a general

A sketch of the algorithm for n generals that tolerates any number of traitors is:

the commander sends a signed order to lieutenants
if a lieutenant receives an order from some one (either from commander directly, or from other lieutenants), he puts it in a set V if it's not already there. Relay the order if there are less than m signatures on the order
Everyone halts at round m+2, and use choice(V) as the desired action

The algorithm is to make all loyal lieutenants keep the same set of V, thus choice(V) is the same. If the commander is loyal, the algorithm works because all loyal lieutenants have the correct order by round 1 and by unforgiblity no more orders can be produced. If the commander is not loyal, by running the algorithm to round m+1, at least one loyal lieutenant will get the order before round m( because there are only m traitors). And that loyal lieutenant will send it to all others. The key is if one loyal lieutenant gets an order, all loyal lieutenants will get it in the next round.

How about missing communication paths?

In previous algorithms, they assume a completely connected network, now this assumption can be relaxed. An algorithm for BGP with m traitors if the graph of generals are 3m-regular is given.

How to use the solution of BGP to build real life systems?

Why is BGP important? The approach to reliable systems is to use redundancy to protect against failure, either random or malicious, of individual parts. The mirroring conditions for a reliable system is to guarantee that:

All non-faulty processors must use the same input value
If input unit is non-faulty, then all non-faulty processes use the value it provides

A BGP solution can be applied to satisfied these two conditions. A faulty input devices may generate meaningless inputs, but BGP guarantees the same meaningless values are used.

3 solutions for BGP are presented, but they are stated in terms of Byzantine generals inside of computer systems. re the assumptions made for these 3 solutions valid or even reasonable for real-life computer systems?

No message loss. In real life, link failures occur. However, link failures are indistinguishable with failures of processors, therefore we can count the link failures as one of the m. Signed message is insensitive to link failures because no message can be forged even if links failed.
Network is completely connected so we know the sender of message. Not necessary if the message is unforgable.
Message loss can be detected. In an asynchronous system, this condition cannot be satisfied. However, in an asynchronous system, no deterministic algorithm can tolerate even on one single failure. So do we really care about asynchronous systems?
Signing message has 2 aspects:
- If processor is non-faulty, then no faulty processor can generate S(M). This is not possible because a faulty process can just spool S(M) and re-send it as it wish. A way to circumvent this is to put in time-stamp or sequence number so re-sending does no good use.
- Given M and X, any one can verify if X == S(M). This is doable in real world.

Conclusion

BGP works but is inherently expensive, especially in terms of number of messages O(m !). Optimization exist, but not much.
It's a trade-off between performance and reliability. If we can assume less reliability by make more (reasonable) assumptions about failure types, we can reduce the cost. If we can't, too bad.

General Critique and Questions

How reliable is reliable? what type of assumptions can we make about computer systems?
How many synchronous system are there? (SMP machines, and?) How about asynchronous systems? Can we really have fault-tolerance in such system? So is BGP algorithms useful at all?
About the connectivities. How many network topology are p-regular?