CS514 Fall 2000: Homework 3 (solution set)
 
 
a)      Show how to encode the consensus problem so that by
solving group membership, we solve consensus.  We are told to assume that the GMS reports an initial view of the
system listing all processes, but lacking any information about their inputs.  For example, the system might start with
processes {A,B,C,D,E} in the initial GMS view.
 
The easiest way to solve this problem is to use abcast.  Processes start execution, read their
inputs, then send abcast messages that report these inputs to one-another.  The decision value is the input value in the
first abcast received.  Since abcast is
delivered in the same order everywhere, everyone picks the same value and it
was one of the input values..  But using
abcast is a bit of a cheat since this is more than just the GMP itself.
 
Suppose we start in the same way – the GMS reports an initial view –
and now processes read their inputs. 
Next, suppose that a process wanting to share it’s input with the others
does so by leaving the view, then rejoining using a slightly modified process “name”
– namely, rather than A, it may rename itself A0, while B might
rename itself B1, etc.  The
first reported view in which one of these renamed processes is listed can be
used to make the decision; if several renamed processes are listed in the view,
decide on the input value reported by the least-ranked process in the
view.  Thus, if the first such new view
is {C,D,B1,A0} we would “decide” on value 1, the input
seen by process B.  Since GMP reports
identical views in an identical order, all processes decide the same value, and
it was one of the input values.
 
To avoid having all processes leave simultaneously, which would knock
out the primary partition, one option is to modify GMP to support a “rejoin_as”
primitive, which combines the functions “leave” and “join” into one
action.  If this seems unfair, since it
changes the protocol, an alternative would be to leave GMP unchanged, but or
use a timing mechanism to delay the leave-join sequences randomly, so that the
odds of more than half trying to leave at the same time is acceptably
small.  For example, each process could
sleep for a random amount of time between 0 and 10 seconds before checking its
input.  Since GMP is very fast
(milliseconds), the odds that more than one change would occur at a time would
be very small.
 
b)      From FLP, we know that it is impossible to solve
consensus with even one faulty process. 
From (a), we know that the Group Membership problem can be reduced to
consensus, and that it is therefore impossible to solve the problem.  Can you explain, in just a few lines, what
this tells us?  Is there a bug in the
GMS protocol?  If so, explain how that
bug would manifest itself and what conditions can provoke it; if not, explain why
not.
 
In any of the above examples, the bad case involves network partitions
that prevent the GMS from achieving the necessary majority consent on views
where a new coordinator is taking over. 
In effect, this provokes an endless series of views in which the current
GMS coordinator is shown as leaving, then rejoining.  The new GMS coordinator that takes over then is seen to fail, and
rejoin, etc.  FLP is basically a proof
that a really nasty adversary could make this happen endlessly without ever letting
a majority commit any particular view change. In effect, we might be forced to
run the first phase of the 3PC protocol for changing GMP coordinator
repeatedly. But the bad scenario is very, very unlikely.  It involves knowing exactly what messages
are being exchanged and triggering partitioning in a very specific way. This
could never happen in the real world. There isn’t a bug in the GMS protocol –
it simply works very hard in this case without being able to commit the next
primary view. 
 
c)      In responding to (b), you’ll conclude that the word
“impossible” is being used in a way that may seen counter-intuitive, although
it is standard for mathematical proofs to use this definition.  Should this form of impossibility worry us,
as designers of real systems, or are there other, more practical issues that
subsume “possibility” and “impossibility”? 
 
By impossible, FLP means “not always possible.”  Yet this sort of situation is absurdly
unlikely.  For practical purposes, one
would normally be happy with a solution that works well enough so that the risk
of it not working – perhaps, getting into this kind of a live-lock – is lower
than the probability of something else happening, like a flood in the machine
room.  Once this sort of threshold is crossed,
we can switch our focus to the plumbing, because failures are unlikely to
originate in our agreement protocol!