The Byzantine Generals Problem
L. Lamport, R. Shostak and M. Pease. ACM Transactions on Programming Languages and
Systems, 4(3):382-401, July 1982
Notes by Xun Wilson Huang
January 01, 2002
Overview
This is a classical paper in distributed algorithm literature that is very well-written
and easy to understand. It is mostly self-contained and here I outline the main ideas and
the results of the paper.
This paper first presents a deceptively simple problem, Byzantine General Problem, and
prove that this problem is not solvable if the number of traitors are 1/3 or more of the
total number of generals. It then presents an algorithm with oral messages that solves the
problem with less than 1/3 traitors. With unforgable signatures, the problem can be solved
with arbitrary number of traitors. Subsequently, the author analyzes how the apply the
solutions of the problem to build reliable computer systems.
Byzantine Generals Problem(BGP)
A commanding general must send an order to his n-1 lieutenant generals such that
- All loyal lieutenants obey the same order
- If the commanding general is loyal, every loyal lieutenant obeys the order he sends.
- A special case of BGP, 3 - General problem with 1 traitor is
not solvable, because a loyal lieutenant can't not distinguish who is the traitor when he
gets conflicting information from the commander and the other lieutenant.
- In general, there is no solution with fewer than 3m+1 generals to cope with m traitors.
Proof by reduction to 3 - general problem, with each of the Byzantine general simulating
at most m lieutenants.
- Reaching approximation is as hard as reaching agreement.
A solution with oral messages for n > 3m
With n > 3m ( where n is the total number of generals and m is total number of
traitors), a solution with oral messages exists for BGP. A oral message system satisfies
the following conditions:
- Every message that is sent is delivered correctly. -> No message loss.
- The receiver of a message knows who sent it. -> Completely connected network
with reliable links(due to 1).
- The absence of a message can be detected. -> Synchronous system only.
The algorithm is cool and fun to read, read it. Of course, the proof is fun as well.
But I think it's more important to just know the result: BGP is solvable if n > 3m.
Unforgable Signatures
The difficulty of 3 - general problem lies in the ability of a traitor lieutenant to
lie about the commander's order, thus if we can restrict this ability by making the
following assumptions, the 3 - general problem is solvable with any number of traitors.
- A loyal general's signature cannot be forged, any alteration can be detected. -> can
drop a message, but can't change it
- Any one can verify the authenticity of a signature. -> no one can fool a general
A sketch of the algorithm for n generals that tolerates any number of traitors
is:
- the commander sends a signed order to lieutenants
- if a lieutenant receives an order from some one (either from commander directly, or from
other lieutenants), he puts it in a set V if it's not already there. Relay the order if
there are less than m signatures on the order
- Everyone halts at round m+2, and use choice(V) as the desired action
The algorithm is to make all loyal lieutenants keep the same set of V, thus choice(V)
is the same. If the commander is loyal, the algorithm works because all loyal lieutenants
have the correct order by round 1 and by unforgiblity no more orders can be produced. If
the commander is not loyal, by running the algorithm to round m+1, at least one loyal
lieutenant will get the order before round m( because there are only m traitors). And that
loyal lieutenant will send it to all others. The key is if one loyal lieutenant gets an
order, all loyal lieutenants will get it in the next round.
How about missing communication paths?
In previous algorithms, they assume a completely connected network, now this assumption
can be relaxed. An algorithm for BGP with m traitors if the graph of generals are
3m-regular is given.
How to use the solution of BGP to build real life systems?
Why is BGP important? The approach to reliable systems is to use redundancy to protect
against failure, either random or malicious, of individual parts. The mirroring conditions
for a reliable system is to guarantee that:
- All non-faulty processors must use the same input value
- If input unit is non-faulty, then all non-faulty processes use the value it provides
A BGP solution can be applied to satisfied these two conditions. A faulty input devices
may generate meaningless inputs, but BGP guarantees the same meaningless
values are used.
3 solutions for BGP are presented, but they are stated in terms of Byzantine
generals inside of computer systems. re the assumptions made for these 3 solutions valid
or even reasonable for real-life computer systems?
- No message loss. In real life, link failures occur. However, link failures are
indistinguishable with failures of processors, therefore we can count the link failures as
one of the m. Signed message is insensitive to link failures because no message can be
forged even if links failed.
- Network is completely connected so we know the sender of message. Not necessary if
the message is unforgable.
- Message loss can be detected. In an asynchronous system, this condition cannot be
satisfied. However, in an asynchronous system, no deterministic algorithm can tolerate
even on one single failure. So do we really care about asynchronous systems?
- Signing message has 2 aspects:
- If processor is non-faulty, then no faulty processor can generate S(M). This is not
possible because a faulty process can just spool S(M) and re-send it as it wish. A way to
circumvent this is to put in time-stamp or sequence number so re-sending does no good use.
- Given M and X, any one can verify if X == S(M). This is doable in real world.
Conclusion
- BGP works but is inherently expensive, especially in terms of number of messages O(m !).
Optimization exist, but not much.
- It's a trade-off between performance and reliability. If we can assume less reliability
by make more (reasonable) assumptions about failure types, we can reduce the cost. If we
can't, too bad.
General Critique and Questions
- How reliable is reliable? what type of assumptions can we make about computer systems?
- How many synchronous system are there? (SMP machines, and?) How about asynchronous
systems? Can we really have fault-tolerance in such system? So is BGP algorithms
useful at all?
- About the connectivities. How many network topology are p-regular?