CS514 Fall 2000: Homework 1

CS514 Fall 2000: Homework 3 (updated). Due in class on Tuesday, Oct. 3

1. This question relates to a famous mathematical result: Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374-382, April 1985. Note: you do not need to read the real paper itself. It will be quite adequate to read this homework assignment, or to read about the paper in the textbook. The actual paper may be hard to understand as it is highly mathematical, although very short.

Here is the definition of consensus as used in the FLP paper. We are given an asynchronous system, in which communication is reliable and there is no time bound on message delivery, nor are there any forms of clocks available. Processes interact only by exchanging messages. At the start of the problem, we are given a set of n>1 processes p₀ … p_n. Each process has an initial input, either 0 or 1. A consensus protocol is a rule for exchanging messages whereby processes that do not fail reach a “decision” state, all processes pick the same value (0 or 1), and that value was one of the input values (e.g. if all inputs were 1, we need to pick 1, and similarly if all were 0).

Processes fail by stopping, and we may assume that at most one process fails.

Basically, FLP proves that any protocol capable of solving fault-tolerant consensus can be prevented from reaching a decision. The authors prove this by showing that if the protocol is correct and fault-tolerant, an infinite execution that can arise. During this execution, the system works hard, yet the processes are prevented from agreeing upon the decision value.

In computer science, we often prove things by reducing a set of problems to a more basic problem, and then convincing ourselves that the more basic problem cannot be solved. Many problems, such as the transactional “commit” problem or the group membership problem are basically equivalent to consensus. That is, we can “encode” a consensus problem in such a way that if we can solve it, or can solve GMP, the same protocol could be used to solve consensus. These encodings are rather simple.

a) Show how to encode the consensus problem so that by solving group membership, we solve consensus. (Hint: Find a way to encode input values so some view reported by the GMP will have the decision value visible in it. You can assume that the processes are all initially members of the GMS, but that initially, their input values are unknown to one-another: the initial view lists {p₀, … , p_n}, but this view doesn’t tell anything about the input value seen by each of the p_i).

b) From FLP, we know that it is impossible to solve consensus with even one faulty process. From (a), we know that the Group Membership problem can be reduced to consensus, and that it is therefore impossible to solve the problem. Can you explain, in just a few lines, what this tells us? Is there a bug in the GMS protocol? If so, explain how that bug would manifest itself and what conditions can provoke it; if not, explain why not.

c) In responding to (b), you’ll conclude that the word “impossible” is being used in a way that may seen counter-intuitive, although it is standard for mathematical proofs to use this definition. Should this form of impossibility worry us, as designers of real systems, or are there other, more practical issues that subsume “possibility” and “impossibility”?

A short glossary of terms used above:

Reach a decision. In FLP, a processes “reaches a decision” if it can figure out what the outcome of the agreement protocol will be. If process P decides that the outcome will be 1, it knows that even if it (P) fails, sooner or later, every process that doesn’t crash will also figure out that the outcome is 1. In effect, P may have been first to realize what the protocol will ultimately do, but eventually, this is the inevitable outcome.

Prevented from reaching a decision. This means that processes exchange messages and execute code yet are not able to deduce the decision value. They are somehow stuck in an undecided status where they believe that both 0 and 1 are plausible outcomes for the agreement protocol. Keep in mind that the protocol is one that is supposed to eventually get everyone to agree on a single value. To be prevented from reaching a decision, some pattern of apparent failures, message loss, or delays would need to somehow fool the system into attempting to reconfigure or to re-run its decision-reaching algorithm, again and again.

The system works hard. Unlike a protocol that blocks, doing nothing, FLP proves that the system can be forced to send tons of messages and do lots of computing.

Correct and fault-tolerant. Their proof doesn’t apply to a protocol that isn’t correct (can sometimes make mistakes, violating the consensus rule) or that isn’t designed to tolerate a crash (by remaining available even though one program has crashed). Interestingly, their proof doesn’t actually involve any crashes. They delay messages or prevent processes from running from time to time, but nothing actually fails at all. Nonetheless, under the assumption that the protocol is fault-tolerant, they “know” that it must eventually reconfigure to exclude such a process, since it looks faulty. And this is their goal: to force an unending sequence of delays and reconfiguration attempts!

Proof. When we talk about proofs, we really mean mathematical proofs: problems stated precisely and rigorously, and things that simply have to be true because, well, you can prove it. No hand-waving allowed!