CS514 Fall 2000: Homework 1

CS514 Fall 2000: Homework 3 (solution set)

a) Show how to encode the consensus problem so that by solving group membership, we solve consensus. We are told to assume that the GMS reports an initial view of the system listing all processes, but lacking any information about their inputs. For example, the system might start with processes {A,B,C,D,E} in the initial GMS view.

The easiest way to solve this problem is to use abcast. Processes start execution, read their inputs, then send abcast messages that report these inputs to one-another. The decision value is the input value in the first abcast received. Since abcast is delivered in the same order everywhere, everyone picks the same value and it was one of the input values.. But using abcast is a bit of a cheat since this is more than just the GMP itself.

Suppose we start in the same way – the GMS reports an initial view – and now processes read their inputs. Next, suppose that a process wanting to share it’s input with the others does so by leaving the view, then rejoining using a slightly modified process “name” – namely, rather than A, it may rename itself A⁰, while B might rename itself B¹, etc. The first reported view in which one of these renamed processes is listed can be used to make the decision; if several renamed processes are listed in the view, decide on the input value reported by the least-ranked process in the view. Thus, if the first such new view is {C,D,B¹,A⁰} we would “decide” on value 1, the input seen by process B. Since GMP reports identical views in an identical order, all processes decide the same value, and it was one of the input values.

To avoid having all processes leave simultaneously, which would knock out the primary partition, one option is to modify GMP to support a “rejoin_as” primitive, which combines the functions “leave” and “join” into one action. If this seems unfair, since it changes the protocol, an alternative would be to leave GMP unchanged, but or use a timing mechanism to delay the leave-join sequences randomly, so that the odds of more than half trying to leave at the same time is acceptably small. For example, each process could sleep for a random amount of time between 0 and 10 seconds before checking its input. Since GMP is very fast (milliseconds), the odds that more than one change would occur at a time would be very small.

b) From FLP, we know that it is impossible to solve consensus with even one faulty process. From (a), we know that the Group Membership problem can be reduced to consensus, and that it is therefore impossible to solve the problem. Can you explain, in just a few lines, what this tells us? Is there a bug in the GMS protocol? If so, explain how that bug would manifest itself and what conditions can provoke it; if not, explain why not.

In any of the above examples, the bad case involves network partitions that prevent the GMS from achieving the necessary majority consent on views where a new coordinator is taking over. In effect, this provokes an endless series of views in which the current GMS coordinator is shown as leaving, then rejoining. The new GMS coordinator that takes over then is seen to fail, and rejoin, etc. FLP is basically a proof that a really nasty adversary could make this happen endlessly without ever letting a majority commit any particular view change. In effect, we might be forced to run the first phase of the 3PC protocol for changing GMP coordinator repeatedly. But the bad scenario is very, very unlikely. It involves knowing exactly what messages are being exchanged and triggering partitioning in a very specific way. This could never happen in the real world. There isn’t a bug in the GMS protocol – it simply works very hard in this case without being able to commit the next primary view.

c) In responding to (b), you’ll conclude that the word “impossible” is being used in a way that may seen counter-intuitive, although it is standard for mathematical proofs to use this definition. Should this form of impossibility worry us, as designers of real systems, or are there other, more practical issues that subsume “possibility” and “impossibility”?

By impossible, FLP means “not always possible.” Yet this sort of situation is absurdly unlikely. For practical purposes, one would normally be happy with a solution that works well enough so that the risk of it not working – perhaps, getting into this kind of a live-lock – is lower than the probability of something else happening, like a flood in the machine room. Once this sort of threshold is crossed, we can switch our focus to the plumbing, because failures are unlikely to originate in our agreement protocol!