BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1328
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Process Membership in Asynchronous Environments
AUTHOR:: Ricciardi, Aleta M.
AUTHOR:: Birman, Kenneth P.
DATE:: February 1993
PAGES:: 40
NOTES:: Replaces 91-1188
ABSTRACT::
The development of reliable distributed software is simplified by the ability 
to assume a fail-stop failure model. We discuss the emulation of such a 
model in an asynchronous distributed environment. The solution we propose, 
called Strong-GMP, can be supported through a highly efficient protocol, and 
has been implemented as part of a distributed systems software project at 
Cornell University. Here, we focus on the precise definition of the problem, 
the protocol, correctness proofs and an analysis of costs.

Keywords: Asynchronous computation; Fault detection; Process membership; 
Fault tolerance; Process group.
END:: CORNELLCS//TR93-1328
BODY::
Process Membership in
Asynchronous Environments
Aleta M. Ricciardi*
Kenneth P. Birman*
TR 93-1328
(replaces TR 91-1188)
February 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*Authors supported by DARPA/NASA Ames grant NAG 2-593 and by grants from IBM
and Siemens Corporation.
Process Membership in Asynchronous
Environments
Aleta M. Ricciardi,
Kenneth P. Birman*
Cornell University
Department of Compnter Science
Ithaca, NY 1?853-75O1 USA
aleta?cs. cornell. edv, ken?cs. cornell. edv
FAX 6O7-255-??28
February 9, 1993
Abstract
The development of reliable distributed software is simplified by the ability to as-
sume a fail-stop failure model. We discuss the emulation of such a model in an asyn-
chronous distributed environment. The solution we propose, calied Strong-GMP, can
be supported through a highly efficient protocol, and has been implemented as part
of a distributed systems software project at Corneli University. Here, we focus on the
precise definition of the problem, the protocol, correctness proofs, and an analysis of
costs.
Keywords Asynchronous computation; Fault detection; Process membership; Fault
tolerance; Process group.
*Authors supported by DARPA/NASA Ames Grant NAG 2-593, and by grants from IBM and Siemens
Corporation.
1 Introduction
The development of distributed software is greatly simplified in environments where process
and communication failures are benign. For this reason, it is common for distributed systems
to be developed under the assumption that the communication network does not partition
and that processes are fail-stop [19, 20] --H that they fail only by halting, and that these
failures are detected accurately.
Unfortunately, real distributed environments are not entirely benign in these respects.
On the one hand, the assumption that programs fail by halting can be satisfied to a good
approximation by careful development methodology and testing. Similarly, most communi-
cation failures, such as message loss, corruption, out-of-order delivery, and replay, can be
detected and corrected at low cost, again with high probability. llowever, this is not the case
for failure detection and network partitions. Communication partitions are unavoidable in
networks, and when they occur, may mimic process failures.
It is well known that the consensus problem cannot be solved in asynchronous systems
subject to process failures [10], and this is often taken to mean that software for realistic
environments must live with some risk of inconsistent failure detection. A related result exists
for the database commit problem in the presense of partition failures [21]. A consequence is
that a great deal of the `fault-tolerant' distributed software used in contemporary networks
is at risk of some form of inconsistent or incorrect behavior when an action is based on the
apparent detection of a process failure.
That such inconsistencies are not very noticeable testifies to the ingenuity of systems
developers in building systems for which inconsistency is not a fatal condition, but also
to the extremely limited use of genuinely distributed programs in modern networks. Most
distributed software is based on one--Htime interactions between a client program and a server;
it is very uncommon to see distributed systems in which any form of continually evolving
distributed state is shared among multiple processes. In client-server systems, it is uncommon
that the detailed behavior of different programs would be compared; hence, inconsistencies
in how programs report and react to failures might not affect the `distributed' computation,
much less be noticed by a casual observer.
Unfortunately, the need to develop fault-tolerant distributed software with non-trivial
distributed state in modern computer networks is seen more and more often in modern
computer applications. One of us (Birman), through work with a distributed programming
environment called Isis [6, 5], has gained experience with a wide range of complex distributed
applications in settings such as telecommunications, factory automation, finance, scientific
computing and the military. In these domains one finds problems that are inherently dis-
2
tributed and require fault-tolerance, and also in which complex distributed state is needed
to operate the desired system correctly. For example, a telecommunications system must
react to failures of switching nodes in a consistent manner; inability to do this can cause the
system to deny service. A brokerage system may need to provide trading advice, based on
changing market conditions, to multiple traders. If two analytic servers are started because
some parts of the system incorrectly sense a primary server as having failed, different traders
may be given differing, inconsistent advice. In settings such as these, inconsistent behavior
can have significant implications and cannot be tolerated.
Similarly, modern distributed operating systems exhibit features that require accurate
failure detection. For example, the Mach operating system [15] identifies communication
endpoints using an abstraction called the communication port. Each port has a single "receive
right"? bound to one process. Rights to send data to a port can be passed among processes,
and are carefully tracked by Mach. Mach guarantees that communication to a port will be
reliable: if a successful outcome is reported to the sender, the message will not be lost unless
the destination fails, and a failure is reported only if the destination is faulty. Additionally,
Mach notifies the holder of a receive-right when all holders of send rights have deleted them
(or failed), and notifies the holder of a send right if the corresponding receiver fails.
Mach is widely cited for its simple and powerful communications model, and has emerged
as an industry standard. However, it is easy to see that this model cannot be implemented
in a way that is both safe and live: the only "safe" way to detect a failure is to wait for
the faulty process to restart, and this can introduce unbounded delays! At the time of this
writing, Mach waits for failed nodes to restart before reporting failures, hence even a single
failure could prevent the system from making progress.
Our work proposes an approach which, although subject to limitations stemming from
the impossibility results cited above, is nonetheless extremely powerful. The basic idea is
to substitute a logical notion of system membership for the physical notion of "operational"
or "failed". In our scheme, application programs define operational processes to be those
listed by the membership service and failed processes to be those not listed as members of the
system. To the extent that the membership service is able to report consistent information to
processes using it, those processes can then implement consistent, fault-tolerant distributed
algorithms.
Our membership service assumes a low-level mechanism that monitors the status of pro-
cesses. The membership service excludes any process from the system that this mechanism
suspects of having failed. If the removed process has not crashed (i.e. the suspicion was
incorrect or due to a transient condition that corrected itself), subsequent communication
from it to the remainder of the system is inhibited. In this way we prevent a "zombie" pro-
3
cess from contradicting the abstraction presented to the remaining system processes. Lastly,
a faulty process that recovers will be notified that it has been dropped from the system.
When physical partitions occur, our membership service prevents the system from log-
ically partitioning. More precisely, our scheme distinguishes the majority partition from
mtnonty partitions. By defining the state of the majority partition to be the true system
state and limiting the actions permitted in a minority partition, logically consistent behavior
can be guaranteed even when a partition occurs. During periods when a majority partition
cannot be constituted, our scheme might treat all partitions as minority ones, effectively
halting the system. The approach is thus one that provides rigorously consistent behavior
at all times, although it may not permit progress in infrequent situations caused by severe
network partitions.
As an example, our membership service could be used to overcome the failure detection
problems currently encountered in Mach. Mach could be made both safe and live if it were
modified to 1) report `apparent' failures to our service (thereby making Mach one of our
suspector mechanisms), 2) treat machines and processes as faulty only when our service
reported them as such, and 3) restrict communication to members of the system (as defined
by our membership service). The Mach communication guarantees would then be satisfied
even in networks where transient disruptions would sometimes cause Mach to suspect failures
application developers would prefer the environment our
incur indefinite delays.
incorrectly. We believe that most
service provides to one that could
Agreement on the membership of a group of processes in a distributed system is a clas-
sic problem, and has been treated elsewhere. Relevant prior research includes solutions for
database contexts [4], real-time settings [8], and distributed control applications [12, 5]. Cris-
tian [9j, specified and solved a problem similar to the one we consider here, but in contrast
to our work, he considered a synchronous setting. Our approach and solution focus on the
asynchronous case, but differ from previous work on group membership for asynchronous
systems. The membership semantics provided by Virtual Partitions [1] are weaker, allow-
ing multiple membership views to exists simultaneously, and requiring neither atomicity nor
uniformity in committing new views. These semantics however reflect a desire to maintain
replicated data availability; our goal is to provide a consistent, unique source of system-wide
membership information. In contrast to [13, 2], which also permit multiple membership
iTlirough experience with several hundred Isis applications, we have observed that the most common
partition case involves a single processor disconnected from its LAN. Network "bridge" failures are uncommon
in LAN settings, and it makes sense to treat WAN systems differently from LAN `s because a WAN has
different performance characteristics. Other systems have adopted different approaches to this issue, however,
such as in Dolev's Transis project [3].
4
views, we do not assume the existence of an underlying fault-tolerant atomic, ordered mul-
ticast. The protocol of Mishra, et.al. [14] also relies on an ordered multicast. In these cases,
the potential membership must be a static set of processes so that the multicast ordering
properties can be maintained. This makes handling process recoveries more straightforward,
but still requires additional mechanism to join newly-created processes. We consider only
point-to-point communication and an arbitrary, unknown set of system processes. We han-
dle joins arising from both process recovery and process creation with the same mechanism.
The protocol in Birman and Joseph [5] blocks during periods when failures and recoveries
occur continuously. Our solution is fully': we can process a constant flow of requests
to both remove and add processes, which is exactly what occurs in actual systems.
In Section 2 we describe our system model and the formal language we will use to specify
the Strong Group Membership Problem (Strong GMP). In Section 3 we specify Strong GMP,
and in Section 4 we present our solution, the S-GMP algorithm. Section 5 gives the main
part of the inductive correctness proof and discusses the protocol's message complexity and
minimality. We conclude by discussing the implications of our particular specification, and
directions for future work.
2 The System Model and Formal Logic
We consider only asynchronous distributed systems in which processes fail by crashing. Dis-
tributed means that the processors are physically separated and that processes executing
in the system communicate only by passing messages along a fixed set of channels. Asyn-
chronous means that the system has no global clock, and that there are no bounds on relative
local clock speeds, execution speeds, or message transmission delays. The asynchrony as-
sumption is realistic: system load, network traffic, and any other dynamic components of
the system that affect performance all conspire to violate synchronization assumptions.
Before defining the abstract computational model, we discuss the goals and effect of our
membership service for processes in asynchronous systems.
2.1 Membership Service Goals
This paper focuses on the events that occur at processes after the membership service is
already established (Section 4.2.4 discusses cold-starting the service). Our goal in building
this membership service is to provide processes in an asynchronous system subject to halting
failures, with an execution environment indistinguishable from a synchronous, halting-failure
system. Here, the term "indistinguishable" refers to the sequence of events observed by a
5
process white it is a member of the system. The situation for a process excluded from the
system is discussed below.
Our solution has the property that when a process, p, learns from the membership service
that another process, q, is no longer a member of the system, p can identify an event in its
execution after which it will never receive another message from q. For p, this is indistin-
guishable from q crashing and the membership service detecting it accurately. Moreover, our
solution constructs a consistent cut [7] along which every other functioning member of the
system will also learn that q is excluded. Consequently, p can take actions that depend both
on q having crashed and on all other processes learning this concurrently (just as it could
in a synchronous environment). In a normal asynchronous system, p would have neither
guarantee.
In our model, a new process, p, must join the system via the membership service before it
can interact with other processes. The service responds with the current system membership
list, and thereafter keeps p informed of each change to the list. For as long as p remains on
the list, it can send messages to all other listed processes, and communication appears to
be reliable and FIFO.2 Finally, our work has the property that all members of the system
observe exactly the same sequence of membership changes (join and leave events), even when
members of the membership service itself fail or join. Elsewhere [18] we show how this strong,
same-sequence property both simplifies distributed algorithms that take actions based upon
membership changes, and, somewhat paradoxically, actually helps in reducing the cost of
the membership service protocol itself.
Processes that genuinely fail do so by halting. We require that such a process is eventually
suspected of having failed, and then removed from the system list.3 Of course, no live failure
detection protocol for asynchronous systems can avoid mistakenly suspecting an operational
process and then removing it from the membership list [10]. Because exclusion from the
membership list will be equated with failure, such exclusions must result in executions that
are consistent with those in which the excluded process had, in fact, failed. Specifically, we
must suppress communication from a process that has been erroneously excluded. To this
end, in addition to FIFO and channel reliability assumptions, we assume processes sever
2It is well know that an underlying message transport system that uses sequence numbers, acknowledge-
ments, and retransmission can overcome message loss, duplication, and out-of-order arrival.
3While we are not concerned with the implementation of the failure suspicion module, this can be quite
inexpensive. In particular, because our membership service places a uniformly observed ranking on system
members it is not necessary that every process monitor every other process. For example, the scheme used in
the Isis system requires each process to monitor only the next-highest ranked process. This seems to imply
a linear cost, but because network speeds are very high the dominant cost turns out to be the overhead
imposed on processes, which is constant and unrelated to system size in this case.
6
communication paths with all others they believe faulty.4
From the excluded process's, say q's, point of view, it can no longer communicate with
other processes, but it can continue local computations. To illustrate the issues suppose q had
been the token-holder in a protocol that orders multicasts among a subset, 5, of processes.
Upon learning of q's failure, the remaining processes in 5 determine a new token holder, say
q', although q will continue believing it is the token holder. Since q can no longer make
its message ordering known to 5 , the fact that q's and q"s orderings may differ does not
violate the (observable) correctness of the message-ordering protocol. That q will `observe'
a different ordering than the rest of 5 is irrelevant.
2.2 System Requirements and Model Assumptions
To implement the FIFO and channel reliability properties we require two things of the
physical system. First, each message sent along a channel must have a non-zero probability
of reaching its destination intact, and second, each process must have a local, monotonically
increasing clock (i.e. counter). These two requirements suffice to implement live failure
suspectors, and a completely-connected network of reliable, FIFO channels. Our protocols
will assume this complete package of communication guarantees, but we are not concerned
with how they are implemented.
As soon as one processes, p, suspects another of having failed, it Disconnects all its com-
munication channels with the suspected process. Moreover, to hide as quickly as possible
an erroneous suspicion, p Gossips (for example, with piggy-backs) its suspicion to all other
processes in further communication, whereupon recipients adopt p's belief and also discon-
nect themselves from the suspected process. The Gossip and Disconnect actions combine to
isolate suspected-faulty processes among processes not believing each other faulty.
The [10] impossibility result can be interpreted as forcing applications in asynchronous
systems to either make accurate failure detections or be live. By choosing liveness, we
admit the possibility of erroneous failure detections, but by isolating mistakenly suspected
processes, we prevent them from further affecting the global system. As a result, q halting
and q mistakenly suspected to have halted are indistinguishable.
4Because the communication layer is asynchronous messages from an excluded process may continue to
arrive, and be rejected, for an unbounded period of time. The communication layer would also inform an
excluded process that it has been excluded, causing it to rejoin the system under a new process identifier.
The protocols needed to implement such a transport layer are evident and will not be presented here.
7
2.3 The System Model
Denote by Proc a countable set of process identifiers, [Th,p2,. .). The process name space
is infinite so that we can model infinite executions in which new processes continually arise.
However because there can be only finitely-many processors, and because process births
require non-zero time, the number of processes extant at any real time in an execution will
always be finite.
Processes may send and receive messages, and do internal computation. The event
send?(q, m) denotes p sending message m to q, and recvq(p, m) denotes q's receipt of m
from p. The distinct internal event crash, models the crash failure of process p, after which
only other crash, events are permitted. A history for process p, denoted hp, is a sequence of
events executed by p, and must begin with the distinct, internal event start,:
h			%ef			star			e2?? ek			k > 0
p			p			p p
We write e Ei hp when e is an event of hp. A cut is an n-tuple of process histories c =
(hp1,hp2,. . . , hp?), where p? Ei Proc. We restrict our attention to cuts determined by finite
subsets of Proc since these represent the system's global system state at some real time in its
execution. Each execution begins with the distinct cut, co = <start ,start ,. . . ,start >.
We also write e Ei c to abbreviate "e E hp for some p mentioned in c", and elaborate when
the context does not clearly distinguish the intention.
We assume familiarity with the happens before relation [11] between events (written
e H e'), and also with consistent cuts [7]. Henceforth we restrict the discussion to consistent
cuts, as they are the ones that are physically realizable. Consistent cuts are the possible
global states of an execution; while a given consistent cut may never have existed at any
point in real time, it is impossible for a cut that is not causally consistent to ever exist at
any point.
A characterization of global causality should incorporate the notion of progress between
global states. Specifically, we desire that every process either makes local progress or remains
stationary, none should regress. A process makes local progress between the cumulative states
represented by hp and hp' exactly when hp is a prefix of hp'.
Definition Given c = ....... ....... , hn) and c' = ....... , ..... . , h'?), c causally precedes
c' (written c < c') if and only if for each process, p either 1) hp = h'?; or 2) hp is a strict
prefix of
Observe that there are (infinitely) many completions for any given cut. In this sense, the
future of any cut is uncertain; it may branch out in many directions. On the other hand,
c < c' implies that any execution in which c' is a prefix must also contain c as a prefix.
8
Pi
P2
p3
p4
aM
a
c
Figure 1: hpj is a strict prefix of hp' for each p? so c  c'.
Definition Let c = (h1,. . . ,hp,. . . ,hn) and c' = (h'i,. . . ?h'p? . , h?') be consistent cuts.
Then c strictly precedes c' (written c < c') if and if c < c' and c ? c'; the cut c very strictly
precedes c' (written c  c') if and only if hp is a strict prefix of hp' for each p mentioned
(Figure 2.3). I
2.4 The Modal Logic
So far, our description of Strong GMP refers to when core members agree on the group view,
as well as the degree of simultaneity with which they do so. A temporal modal logic allows
us to express these notions. Unique to our logic is its attention to asynchrony - the basic
semantic entities of the logic are consistent cuts. We briefly describe the temporal modalities
we use to specify Strong GMP.
Given a propositional formula, 4), and the < relation between cuts, the formula 54) holds
along cut c precisely when 4) holds along all future cuts in all runs containing c (i.e. every
c' such that c < c'). ?4) holds along c when 4) will hold along some future cut in every run
containing c. We interpret ? as "inevitability" ?4) holds along c if 4) held at some c' ?
and H4) if 4) held along all c' <c.
3 Strong Group Membership
We now formally define the Strong Group Membership Problem for asynchronous systems.
Our definition specifies how to coordinate local events among a group of processes so that the
group's externally observed behavior is indistinguishable from that of a single, fault-tolerant
process. Thus, any solution to Strong GMP can be used to build a system membership
9
service (which we call a Membership Resource Manager, or MRM). In this section, and in
the rest of the paper, we restrict our focus to the core processes implementing the MRM --H
the formal problem describing their actions, and the algoritlim solving this problem. Thus,
we describe a hierarchical approach to building a Strong GMP membership service, in that
our protocol is run by a small core set of processes, which use a cheap replication scheme (e.g.
the Isis replication tools) to maintain a fault-tolerant member list for the overall system.
Creating the illusion of a single fault-tolerant process means that core members must
agree, not only on the entire system membership, but also on the composition of the MRM
core. A core member that fails or is otherwise removed must be consistent with the rest of
the core while it is a member. More important, a core member that is removed from the
core but has not halted must not be able to misrepresent the system state; our specification
must preclude such a process from changing its local view of the system's membership or
the core's membership independently.
3.1 Formal Specification
The formula UP,, holds along a cut if and only if p has not executed crash,, in its local history
component of that cut. Conversely, DOWN,, holds along c exactly when p has crashed in c.
The indexical set Up(c) in an asynchronous run A is the set of all processes that have not
crashed along c: Up(c) ?P I UP,, holds along c?.
Process p executes the event faulty,,(q) as soon as it suspects q faulty; whether p comes
to suspect q through some local observation or through our Gossip assumption (Section 2.2)
is immaterial. Some time after recording faully,,(q), p will execute the event remove,,(q).
The distinction between these events is significant: fanlty,,(q) represents p's belief in q's
faultiness, which may be incorrect, while remove,,(q) is actual removal of q from the set of core
members p believes operational. The formula FAuLTY,,(q) holds along all cuts that include
favlty,,(q), and REMovE,,(q) along all cuts that include remove,,(q). Analogous statements
hold for events operating,,(q) (p believes q is functional) and add,,(q) (p adds q to the set
of core members), and formulas oPERATiNG,,(q) and ADD,,(q). In contrast to FAULTY,,(q),
oPERATiNG,,(q) is not stable.
The local membership view for process p along cut c =(h1,... , ..... . ,hn), (denoted
LocalView,,(c)), is the set of processes p obtains by sequentially modifying its initial mem-
bership list according to the remove,,() and add,,() events in h,,. We use LocalView,, when the
cut is clear from context. Trivially, we require p E LocalView,,(c). The formula iN-LocAL,,(q)
holds along all cuts, c, such that q E Localview,,(c). Because hp is linear, it makes sense to talk
about the xth version of p's local view, which we denote Localview,,x. Finally IN?LocAL,,z(q)
holds when q E LocaIView,,?. The formula NoTDEF?D(Locaiview,,x) holds along c if p has not
10
(yet) defined it's xth local view.
We extend local views to group views as follows. Given 5 c Proc and a consistent cut
c, if the local views of all the functional processes in 5 are identical, the group view is the
agreed-upon local view; if 5 has no functioning members or if the functioning members of
5 have different local views, the group view is undefined. We say that 5 determines a group
view. Formally:
Definition Given a consistent cut c and a set of processes, 5 C Proc, the group view
determined by 5 along c is
GpView5(c) =
LocaIView? (c)
undefined
A(?,q? (SnUp(c)) #?
(LocalViewp(c) = Localviewq(c))
otherwise.
The formula uNDEF'D(GpView5(c)) holds along c if the local views of any functional
members of 5 disagree or if 5 A Up(c) =
Constraining Membership in GpView5(c)
The definition of GpView5 (c) is crucial to the class of Group Membership Problems so it
is worthwhile discussing how the sets 5 and GpView5(c) relate. Recall that GpView5(c) is
the abstraction we are using to define the single, fault-tolerant process illusion that will be
used to build an MRM.5 In this light, MRM core members are precisely the members of
GpView5 (c).
If q E (GpView5(c) A --H5) then q is a core MRM member whose local view is not used in
determining either the MRM composition or the total system membership. Specifically, q's
local view is not constrained by the definition of GpView5(c), so LocalViewq(c) need not be
identical to GpView5(c). Because q replies to MRM client requests based on its local view,
its replies will contradict other core members' replies when LocaiViewq (c) # Gpviews (c); the
single-process illusion falls apart. Consequently, unless every core member's local view is
used to determine the MRM group view, the MRM cannot guarantee global consistency.
To avoid this, our speAfication forces q to be in 5 whenever it is in GpView5 (c)
(GPviews(c) A UP(c)) c			A UP(c)).
5In practice, the group view, and therefore each core member's local view, includes the entire system
membership in addition to the MRM composition; here we are only concerned with the MRM composition.
11
The reverse inclusion follows trivially since p E LocaIView?(c). Consequently, our specifica-
tion requires S = Gpviews(c).
To finish the single-process illusion, the MRM must be unique. We will therefore also
require that there be at most one set satisfying this equality along any consistent cut. Since
there can only be one MRM, some form of quorum consensus is needed to change the global
system membership. If a quorum cannot be attained (for example during certain partitions),
no solution to Strong GMP can make progress.
Finally, GpViews(c) is defined if and only if the local views of all its functioning members
agree. Processes that are eventually removed from the core are not excused from having
consistent views while they are members. Moreover, a core member that is removed but
has not crashed cannot be allowed to change its local view. Our specification captures
these safety issues in two clauses: GMP-2 formalizes the Uniqueness requirement for group
views, and GMP-3, by requiring every local view to exist as a group view, prevents excluded
processes from taking actions unilaterally.
3.2 Strong GMP Specification
We now have the language necessary to formalize Strong GMP. Since formulas are evaluated
along cuts we drop references to cuts in indexical sets.
GMP-O (Base Case) An initial group view eventually exists:
?			v			s0 = GpViews0()
S0cPrnc
GMP-1 (Validity) Processes do not make changes to their local views capriciously. For
example, if q were once, but is not currently, in LocaIView? then p should believe q
faulty.
a. (?iN?LocALp(q) A ?iN?LocALp(q)) ? FAULTYp(q)
b. (?iN?LocALp(q) A iN?LocALp(q)) ? ? oPERATIN%(q).
In contrast to FAULTY?(q), OPERATINGp(q) is not stable.
GMP-2 (Uniqueness) Non-null group views are unique along all consistent cuts.
v			(GPViews() =			A UNDEF'D(Gpviewsi())
12
The formula IN?GPp holds along all cuts, c, such that p E GpViews(c) (when it is
defined); OUT?GPp holds when p ? Gpviews(c) (also provided Gpviews(c) is defined).
GMP-3 (Sequence) All processes exhibit the same sequence of local views, provided the
views are defined. Moreover, there is a sequence of cuts along which each local view is
a system view:
0/4Ap?)
IN?LocALpx(q) ? DOWNq V (Locaiviewq() = LocaIView?() = LocaIview?x
(?iN?LocAtpz(q) A ? i???oc???(q)) ? 0 NoTDEF `D( Localviewxq)
GMP-4 (Liveness) For each event fauUy?(q) (respective, operating?(q)) and each process
p E Gpview:, eventually either p is removed from the group view, or q is removed from
it (respective, added to it):
a.			FAULTYp(q) A IN?GPp ?			q V ?OUT GPp)
b. oPERATiNGp(q) A IN?GPp ? (?N?GPq V ?oUT?GP).
GMP-3 is equivalent to requiring that each local view eventually becomes a group view.
The presence and placement of the ? modality forces a group view to exist along some
consistent cut in an execution. This too, is why we cannot bind LocalViewq to a version
number. Local views indexed by version numbers are static --H the composition of a process's
xth local view will never change. If c is the witness cut for ?, then omitting the version
superscript forces LocalViewq (c) (for every q Ei LocaIview?x) to be identical to LocaIview?x at
least along c. liad we included the version number in the equality clause, we would not have
been able to conclude that group views necessarily exist, since the local views need not have
been identical simultaneously.
Finally since each process executes at least one event between local views x and x + 1,
the corresponding group views will exist along cuts that are related by ?, so it makes sense
to talk about the xth group view, which we denote Gpview:.
4 A Protocol Solving Strong GMP
Our solution to Strong GMP, the Strong Group Membership Protocol (hereafter s-GMP),
is asymmetric and centralized: a distinguished core member, denoted rngr, coordinates
13
mgr
p
q
r
M-sub(--Hq)			M-com(--Hq)
Phase I			Phase II
Figure 2: Two-Phase Communication Structure of Simple S-GMP.
updates among all core members' local views. In a symmetric, distributed solution [14, 3]
all core members would behave identically and make updates independently. We chose the
centralized approach for two reasons: it requires only 0(n) point-to-point messages, instead
of 0(n2), and it is a simpler paradigm within which to reason. While mgr `s failure is more
troublesome to handle than an outer (non-mgr) member's, the benefits of the centralized
approach, coupled with the low probability of the myr failing outweigh these concerns.
An important aspect of S-GMP is the lack of restrictions on changes to a group view.
Specifically, there is no upper limit on the number of processes one can add to Gpviewx to
form Gpview:+l; if removing processes from Gpvlewx, the upper limit is the size of the
largest minority subset of Gpviewz. This flexibility broadens fault-tolerance, and enables a
membership service defined by Strong GMP to adapt quickly to changes in system load. The
Appendix contains the complete protocol.
4.1 Simple S-GMP
While we assume in this section that myr does not fail, the protocol we present has a
more complicated communication structure and degree of coordination than this assumption
warrants. Indeed, if we knew mgr could not fail we would already have a single, fault-tolerant
process. Anticipating mgr `s failure simplifies describing reconfiguration in Section 4.2.
When myr suspects an outer member's (or some subset of outer members'), say r's,
failure, it initiates a two-phase update algorithm. In Phase I (Figure 2) mgr proposes q's
removal by multicasting a submit message, M-sub(--Hq), to the members of its local view
(multicasts are not failure-atomic). rngr then waits for each member to respond, or to start
14
believing a member faulty. In this way, at the end of Phase I, all core members that mgr does
not believe faulty, believe q faulty. If mgr receives responses from a majority subset of its
current local view, it multicasts a commit message, M-com(--Hq), in Phase II;6 mgr must block
if it does not receive a majority response. If local views are identical at the beginning of this
protocol, because mgr is a single process, local views are identical at the end of it.
The submit message coordinates belief among the core in q's faultiness; the commit
message tells outer members that the group has reached agreement on q's failure and that
they should now remove q from their own local views. However, because mgr does not receive
responses from outer members it believes faulty, it cannot know whether these members
received its submit message. From mgr `s perspective, these members may not be aware
of the current update to the group view, rendering core-wide agreement on the new view
contingent upon the subsequent removal of these `faulty' members. The Gossip assumption
ensures that operational outer processes become aware of such contingencies.
When adding to the group view, mgr sends the new process(es) p a State-X?er message
giving p permission to join and informing p of all relevant system state. mgr awaits a reply
(or suspicion of p's faultiness) and then multicasts the commit message to the entire new
group. To simplify bookkeeping, new members begin with local version equal to the group
version in which their addition resulted.
4.2 Full S-GMP
When mgr is believed to have failed the outer members execute a reconfiguration algorithm to
select a new coordinator and, if necessary, reestablish the group view. Local view agreement
may be lost, for example, when mgr fails in the middle of a M-corn() multicast. In Figure 3
local views differ along the second cut so the group view is undefined.
Reconfiguring successfully involves solving two problems: succession --H which process(es)
should initiate reconfiguration and which should assume the mgr role at the end; and pro-
gression --H which update should a reconfiguration initiator propose to resolve core members'
inconsistencies?
A reconfigurer must be able to determine the last defined group view and propagate the
correct proposal for the succeeding group view. Extrapolating from Figure 3, we see that
proposals may also be partially known among the current group view.
The most difficult aspect of reconfiguring involves invisible commits. An invisible commit
occurs when the only processes receiving a commit message fail, or are believed faulty by the
6Typically a phase of communication consists of a multicast from a single process to a group of processes
and their responses back to the initiator. In fact, Simple 5-OMP is one-and-one-half phases, but this is
awkward.
15
mgr
Unlike the mgr -initiated algoritlim, reconfiguration requires three phases in the worst case.
This is an outgrowth of the Sequence requirement (GMP-3) and the possibility of invisible
commits; we discuss this below in more detail. For simplicity, we present the algorithm here
as always using three phases ([17] discusses the cases when two suffice).
For p with ver(p) = x --H 1, let NextUpdate? be the tuple [< v, x >, rank(?)]?, where v is the
value p is waiting to commit to form LocaIviewx?, and rank(i) is the rank (in LocaIview?x?l)
of the initiator that submitted <v, x>. LastCommit? is the value p committed to form
LocaIviewx??l. state(p) is p's local state information: ver(p), NextUpdate?, and LastCommit?.
In the first phase, the initiator r, multicasts a reconfiguration interrogate message,
R-int(state(r)), to its local view. The reconfigurer then awaits responses from the outer
processes, or its own belief in their faultiness. Upon receiving R-Int(state(r)) a core mem-
ber that is lagging behind r adopts r's local state as its own (committing the appropriate
value, and so forth). Every core member, whether it just updated its local state or not,
responds to the reconfigurer with its current local state, state()
If a majority respond, then r uses the information it received to determine an update
p
the same set of processes.
Figure 3: mgr `s Failure Results in Undefined Group View.
rest of the group. This is significant for reconfiguration: while no subsequent reconfigurer
will ever know whether these processes committed the change to their local views, GMP-3
requires that if an invisible commit did occur, the remaining core members must behave
consistently. It is imperative, then, that every invisibly committed update be detectable by
every reconfigurer. We can ensure this only if all initiators (whether mgr or a reconfigurer)
attempting to install the ?th group view vie for the requisite majority responses from among
GpView
x
4.2.1 The Reconfiguration Algorithm
r
LocaIview?x+l
16
value, say v, and version number, say x, whose execution would result in a new group
view. The initiator multicasts this event as the Phase II reconfiguration submit message,
R-sub(< v, x>). After obtaining a second majority response acknowledging R-sub(< v,
r multicasts the Phase III recon?guration commit message, R-corn(< v, x>). Again, major-
ity response to R-int(state(r)) and R-sub(< v, x >) are essential in maintaining GMP-2 and
GMP-3; without either, r must block. If the committed operation is the addition of a set
of processes, say Q, then q e Q must respond with any pending NextUpdateq value it may
have (Figure 4). This is necessary to maintain GMP-2 and CMP-3 in cases where Q had
already joined at the behest of a previous intiator, for example mgr. If mgr had been able
to propose and additional update to Q, say M-sub(< +R, x + 2>), it may also have been
able to commit < +R, x + 2> invisibly to r and Q.
Definition An update initiator (either mgr or a reconfigurer) is successful for a submission
(M-sub(< v, x >) or R-sub(< v, x >)) if a majority subset of the initiator's local view respond
to the submission. In this case, we say the submitted value is stable.
A successful initiator is able, if it does not fail, to commit the value it submitted. In this
light, GMP-3 means that all successful version x initiators must make identical proposals.
The local state information collected during reconfiguration Phase I must allow a reconfigurer
to determine the correct update proposal unambiguously.
In S-GMP all successful reconfigurers attempting to install (or complete the installation
of) the ?th group view propagate mgr `s proposal if they become aware of it; they propose
mgr `s removal if they do not. Unfortunately, as Figure 5 makes clear, asynchrony and
inopportune failures can result in there being two different proposals for the same instance
of the group view. There, reconfigurer r1 does not learn of mgr `s proposal, <v, x >, and so
proposes mgr `s removal for version x (as dictated by Procedure DetermTheProposal in the
Appendix). The subsequent reconfigurer r2 learns of both proposals and must then decide
which to propagate. Correctness requires that only one of the two proposals become stable,
and that any non-blocking reconfigurer be able to determine which one it is by the end of
Phase I (we discuss how a reconfigurer determines this in Section 4.2.3). By propagating the
stable submission, a reconfigurer forces the entire group to act consistently with any invisible
commits.
4.2.2 Rules of Succession
We solve the succession problem by imposing a deterministic, linear ranking on core members
based on seniority in the group view --H `older' core members are ranked higher. This is sensible
only if group views are unique and agreed-upon. Let rank(p) denote p's rank. Whenever a
17
r determines R-sub(< +Q,x + 1 >$ -
but may next M-sub(< --Hmgr , x + 2>)
R-com(<+Q,x+1 >) then
M-sub(< +R,x+2>)
M-com(< +Q,x + 1>)
M-sub(< +R, x +2>)
State-Xfer
Q
mgr
r
Nex?+tU?P,d;+te?, ?ank(mgr)]
R--H int (state(r)
Figure 4: A Situation Requiring New Core Members to Report NextU pdate to Initiator
(mgr `s commit message cannot reach any processes except Q)
18
r1
R
s
r2
T
myr
R-int(...)
R-sub(? --Hmgr,x>)
\`t1 said,?R--Hsub(< --Hmgr,x
R-int(...)
"mgr said,'N--Hsub(< v, x
M-sub(? v,x>)
Figure 5: Reconfigurer r2 Learns of Conflicting Proposals for Gpviewr
process is removed from the group view, the ranks of all higher-ranked processes are decreased
by
one.
A process initiates reconfiguration when it believes all others ranked higher than itself
are faulty. That is, given cut c and LocaIView?(c),
INITIATE(p)			A			(rank(q) > rank(P))			FAULTY?(q)
qELocaIViewp(c)
While initiating reconfiguration on INITIATE(p) can lead to multiple reconfigurations, it
guarantees at least one process will undertake reconfiguring. Consider Figure 6 in which
rank(mgr) = p, rank(p) = p --H 1, and rank(q) = p --H2, and both p and q believe rngr faulty.
In the second scenario q expects p, which has crashed, to initiate a reconfiguration; any
solution must ensure that q eventually comes to suspect p faulty. In S-GMP, q times-out
waiting for p's R-int() message, surmises FAULTYq(p), and then initiates reconfiguration.
In the third scenario both p and q initiate reconfigurations. S-GMP must also ensure view
uniqueness in the face of multiple, concurrent reconfiguration attempts.
19
Scenario UPp FAULTYq(p) INITIATE(q) INITIATE(p)
1st			True			False			False			True
2nd			False			False			Eventually			False
3rd			True			True			True			True
4th			False			True			True			False
Figure 6: Initiating Reconfiguration: FAULTYp(mgr) A FAULTYq(rngr ).rank(mgr)
rank(p) + 1 rank(q) + 2.
Q
q
r
R
faultyQ (r)
R--H int (state(q))
R--Hint (slate(r))
faultyR (q)
Figure 7: Majority of Responses Needed
4.2.3 Rules of Progression
To understand the difficulties in reconfiguring we examine GMP-2 and GMP-3 more closely.
Uniqueness requires that at most one group view exists along any consistent cut. In the
situation depicted in Figure 7, Q and R are subsets of Gpviewx, and q and r are both
initiating reconfiguration. If all members of Q believe r faulty, the Disconnect assumption
means they will receive none of r's messages. Analogous statements hold for the members of
R regarding q. If r's proposal differs from q's then the members of R will commit a different
value than the members of Q. If R u trj eventually remove all of Q U fqj, and Q U ?q?
eventually remove all of R u trj, two distinct group views will exist.
Naively, it would appear that the majority requirement suffices to ensure Uniqueness.
However, as Figure 3 makes clear, initiators that may end up installing (submitting and
20
mgr
p
q
r
M-corn(< v,x>)
R--Hint(state(r))
Figure 8: Value <v, x> Committed Invisibly to p, q, and r
committing) the same group version need not begin recon?guration with identical local
views and so may be seeking majority approval from different sets of processes.
Reconfiguration Phase I Responses
Outer processes' responses to R-int(state(r)) must allow r to determine the nature and
composition of all local view inconsistencies, including inconsistencies involving core members
that did not respond to r. Local view information alone is insufficient to satisfy GMP-3
(Sequence) as invisible commits are not detectable.
In Figure 8, <v, x> is committed invisibly to p, q, and r. Since all three have identical
local views, r will not detect the actual discrepancy. However, p is aware of myr `s intention
to commit <v, x >, and p can envision a situation in which mgr succeeded in doing so
and then failed (in this case, the situation that actually occurred). If p were to forward
mgr `s intention to commit < v, x >, r would then envision the same situation and propagate
<v, x> as its Phase II submission. Thus, in addition to its local view, an outer member
must also report how it expects to change its local view next.
We have described how a reconfigurer may discover two different values were proposed
for the same group version. In S-GMP the reconfigurer propagates the value proposed by the
process of least rank among those making proposals (Procedure GetStableProposal in the
Appendix). Proposition 5.7 proves this choices ensures GMP-2 and GMP-3.
21
4.2.4 Membership Service Startup
Our approach depends on the initial group view being unique, and this is difficult to guar-
antee in asynchronous systems. We use a heuristic borrowed from previous versions of the
Isis system. Briefly, "cold-start" of the MRM is limited to a small, known set of sites. To
cold-start, a process first queries these locations to determine whether any others have begun
the cold-start procedure. It continues the cold-start procedure only if it determines no oth-
ers have begun, or if it "outranks" all processes that have concurrently begun cold-starting.
Because we iterate this procedure the probability that two cold-starting processes remain
unaware of each other diminishes with each round. After a suitable number of successful
rounds a process determines it should start the MRM. Although probabilistic, we find this
scheme highly successful in practice.7
5 Correctness
The proof that S-GMP correctly solves Strong GMP is inductive. In this section we present the
more interesting theorems of the inductive step. We show that if Gpviewx?i is uniquely de-
fined, s-GMP results in exactly one value being committed among the members of Gpviewx?i to
obtain Gpviewx. That s-GMP satisfies GMP-2 and GMP-3 follows from there.
The major steps in the proof are, first, showing that all initiators attempting to install
Gpviewx do so starting from LocaIview:??l As a result, all such initiators compete for
majority approval from the same set of processes. We use this result when we show that a
reconfigurer knows which of the two proposals it may learn of could not have been stable.
While the other proposal may not, in actuality, be stable, by choosing to propagate it, the
reconfigurer cannot possibly act inconsistently with the subset of the core that is `invisible'
to it. Stated another way, we show that all successful initiators propose the same value for
GpViewz, and that this value is the only one that can possibly be committed.
For brevity, we do not prove all propositions; the full proof of correctness is in [17].
5.1 The Inductive Step
As in Section 4.2.1, NextUpdate? is the tuple [< v, ver(p) + 1 >, rank(i)]?. For each p, Gossip,
Disconnect and INITIATE() mean that NextUpdate? is always the proposal of the lowest-ranked
initiator from which p received proposals for version ver(p) + 1.
7In several years of wide use no problems have ever been traced to the restart scheme. Note also that
limiting cold-start to a single, known, site suffices to guarantee uniqueness of the initial view but unfortu-
nately this scheme is now vulnerable to a liveness problem: we may be unable to restart the system after a
crash.
22
For process r multicasting message m, Acks(r, m) is the set of processes from which r
receives a message acknowledging, or in response to m. Let
Aheadr def E Acks(r,R-int(state(r))) A (ver(p) > ver(r))?
Proposition 5.1 If r is a reconfiguration initiator with ver(r) x --H 1, then for every p
responding to R-int(x --H 1), x --H 1 < ver(p) <x. I
For process p, let FauIty? = fq IN-LocAL?(q) A FAULTYp(q)?.8 We say Gpviewx?i is p-
defined along cut c if p knows at c that every process in LocaIView?(c) --H FauIty? has defined
its (x --H 1)st local view. Of course Gpview:?l may not be defined globally, but from p's point
of view, Gpview??i is (or has been) defined. For a reconfigurer r, Gpviewz?l is r-defined
at the end of Reconfiguration Phase I if every process in Acks(r, R-int(x --H 1)) --H Faultyr
reported a local version at least as large as x --H 1
Proposition 5.2 Let r be a reconfiguration initiator. Then r proposes version x if and only
if Gpviewx?i is the most recent (i.e. highest-numbered) r-defined system view at the end of
Reconfiguration Phase I.
Proof Follows from analyzing procedure DetermineProposal in Section 7
From Proposition 5.2 we infer that an initiator attempting to install version x has local
version either x --H 1 or x. We now show it can only have local version x --H 1.
Proposition 5.3 For any initiator, r, if r proposes <v, x >, then ver(r) = x --H 1.
Proof The proof is trivial when r = mgr, so suppose r is a reconfiguration initiator, with
ver(r) > x. When r multicasts R-int(state(r)), any process p lagging behind r adopts r's
local state as its own.9 Thus, when it responds to r's interrogate message, state(p) = state(r)
making ver(r) the most recent r-defined version. From Proposition 5.2 r would then propose
some value for version ver(r) + 1, and not version x. On the other hand, ver(r) < x --H 1 is
impossible if Gpviewx?l is the most recent r-defined view.
Hereafter, we use sub() and corn() to denote generic submit and commit messages irre-
spective of the initiator's role (myr or reconfigurer).
8FauIty? is implicitly indexical.
9It turns out that any process r does not believe faulty at the end of Phase I will have local version at
least ver(r) --H 1 when it receives R--Hint(sta?e(r)).
23
Vx+2
Vz+1
p
Aheadr
r
com(tv?+i)
sub(+v?+2)			Acks(r, R-int(x))
`A			Acks(r,R--Hint(x))
`A
R-int(x)
r is waiting to commit <Vx+1, X + 1 >
Figure 9: Possible divergence: r is a potentially successful reconfigurer, and p is an initiator
that could commit < Vz+3, X + 3> without r learning the value Vx+3.
24
To illustrate the difficulty in proving Sequence (GMP-3) consider the following situation,
depicted in Figure 9. Let r be a reconfigurer with ver(r) = x, and let Acks(r, R-int(x)) (we
use R-int(ver(r)) rather than R-int(state(r)) to get explicit reference to r's local version) be
a majority subset of Localviewzr. Proposition 5.1 means the largest version number observed
among r's respondents is x + 1, so suppose Aheadr is non-null and let p be a process from
which some member of Aheadr received corn(< Vx+i, x + 1 >). Suppose further that p also
proposed a value, <Vx+2, x + 2>, for version x + 2 to which every member of Aheadr re-
sponded. Making matters worse, r can imagine all the processes that did not respond to
its own R-int(state(r)) message may have responded to p's sub(< Vx+2, x + 2 >). It may
then be the case that Aheadr U Acks(r,R-int(state(r))) (and Vx+i, if it is an add()) form
a majority subset of GpView?+1, thereby allowing p to commit view x + 2. Trouble arises
if Acks(r, R-int(state(r))) (and Vx+i and Vz+2, if both are add() operations) is a majority
subset of GpViewZ+2; neither r nor any process in Acks(r, R-int(state(r))) can know what
value p would propose for view x + 3.
Proposition 5.4 shows that when r is successful and <Vx+i, x + 1 > is a remove() opera-
tion, no previous initiator (like p) can commit a version greater than x + 1. Proposition 5.5
shows that when <Vx+i, x + 1 > is an add() operation, it is possible for p to continue com-
mitting new group views and for r to lag behind p. However, if both are successful for a
given group version, both commit the same value. These propositions address exactly the
situation when it appears S-GMP could violate GMP-2 and GMP-3: when two initiators
are successful for the same group version. Propositions 5.4 and 5.5 prove that even when
initiators do not vie for majorities from among the same set of core members (their local
views differ), s-GMP is safe.
Proposition 5.4 Let r be a reconfiguration initiator with ver(r) = x. Let Aheadr C
Acks(r, R-int(x)) report local version x + 1, and let p be a process from which some mem-
ber of Aheadr received com(< Vx+l, x + 1 >). If Acks(r,R--Hint(x)) is a majority subset of
Localview:r and < Vx+i, x + 1 > is the removal of a set of processes from Gpview:, then p
cannot be successful for any view numbered higher than x + 1.
Proof Let q E Aheadr and let p be as described. Since q received R-int(x) from r after
com(< Vx+i, x + 1 >) from p, it must be that rank(r) ? rank(p), so FAULTYq(p) holds for
every such q in Acks(r, R-int(x)) (by Gossip). As a result, the initiator p can be successful for
x+2 if and only if Acks(r,R-int(x)) is a majority subsetofLocaIviewx?+l --H LocaIview?x?vx+i.
Observe that Localviewxr = Acks(r, R-int(x)) u Acks(r, R-int(x)), and that r cannot
have received ack(R-int(x)) responses from the members of v?+i;iO in other words, Vx+l C
10Reconfigurer r must have received p's proposal to remove v?+1 or else q would not have received r's
25
Acks(r, R-int(x)). Thus p can commit LocaIview?+2 if and only if (Acks(r,R--Hint(x))--Hvx+i)
is a majority of LocaIviewz?+l Let a Acks(r,R-int(x)) and a = Acks(r,R--Hint(x))
Then initiator p is successful for <Vx+2, x + 2> if and only if
a--HVz+lI			=			a--HVx+lI			a--HVz+l			> 1			- --H
LocaIview?z+l I			GPvIewx I --H I Vx+l			+ a --H Vx+l			2			a			Vx+i > OL
contradicting the assumption that Acks(r, R-int(x)) is a majority subset of GpVjewx
Proposition 5.5 Let r be a reconfiguration initiator with ver(r) = x. Let Aheadr be non-
null and let p be a process that sent com(? Vx+1, x + 1 >) to some member of Ahead?. Then
if r is successful for < Vz+1, x + 1 >, then if p later submits < Vx+2, x + 2 >, either r or
but not both, can be successful for version x + 2.
Proof GMP-3 will be violated if p is able to commit <V:+2, x + 2> and r is able to
commit ? --HAcks(r, R-int(x)), x + 2>. We proceed by analyzing the messages arriving at
Vx+1.
(a) Consider Figure 10 (top diagram). The two-headed split-arrow message from r to
Vx+1 represents the two possibilities for the arrival of r's commit message,
m = R-com(< Vz+l,X + 1 >) : M-sub(< --HAcks(r,R-int(x)),x + 2>),
at Vx+1. P'5 commit message, com(< Vx+1, x + 1 >), to Vx+1 is a dashed because of the
possibility that it may not be received. We elide r's Phase II submit message.
Suppose the members of V:+l receive m from r before they receive com(< Vx+1, x + 1 >)
from p. Since r's message gossips its belief in p's faultiness, the members of Vx+1 will never
receive another message from p. In particular the members of Vx+i will not receive p's
subsequent M-sub(< Vx+2,X + 2 >). We say r owns Vx+1.
Using a and a as defined in Proposition 5.4, p is successful for version x + 2 if and only
if a > a + Vz+l and r is successful for version x + 2 if and only if a + Vx+l I > a. Both
conditions cannot hold.
(b) If the processes in Vx+1 receive m from r after com( < Vx+i, x + 1 >) and before
M-sub(< Vx+2, x + 2 >) from p, the analysis is the same as in (a); r owns Vx+1 once the
members of that set receive m.
(c) In the last case (Figure 10 bottom), p owns Vx+l if its M-sub(< Vx+2, x + 2 >) message,
gossiping p's belief in r's faultiness, arrives at Vx+1 before r's R-corn() message does. Then
R-int(x); had r not received and responded to p's proposal, p's commit message to q would have gossiped
fa?Jtyp(r).
26
Vx+1
p
r
V:+l
p
r
com(<v?+i,x+1>)
M-sub(? Vx+2, X t 2>)
R-illt(x)
Cases (a) and (b): r OWll5 Vx+1.
com(<v?+i,xt1>)
<Vx+2,X t 2>)
R-int(x)
Case (c): p 0Wll5 Vx+1.
Figure 10: Case Analysis for Proposition 5,5
27
p is successful for version x + 2 if and only if c + Vx+i > a and r is successful for version
x + 2 if and only if a> a + I Vx+i . Again, both conditions cannot hold. I
it remains to prove that when r learns of two version-identical proposals (how this situa-
tion may arise was described in Section 4, Figure 5), the proposal submitted by the initiator
of least rank is the only one that could have been invisibly committed. That is, r correctly
identifies the initiator and proposal that could have been successful; as a result r cannot
act inconsistently with any invisible commits. Referring to the S-GMP algorithm in the Ap-
pendix this necessity arises in determining v2 when Aheadr # ?, and in determining vi when
Aheadr
Let
Gpviewx?i is the most recent r-defined group view and define Submissionsr(x) to be
the set of proposed next updates for version x that r learns about in response to its R-int()
message: for some initiator z,
Submissionsr(x) = ?Vx ?P e Acks(r,R-int()) : NextUpdate? = [v?,x,rank(i)]?
We first describe the composition of Submissionsr(x), showing that every reconfigurer propos-
ing version x either propagates mgr `s proposal for version x or proposes mgr `s removal.
Proposition 5.6 For all versions x, Submissionsr(x) I < 2.
Proof inspecting procedure DetermineProposal, different submissions for the same view
can arise only from mgr and from a reconfiguration initiator proposing mgr `s removal. The
latter occurs if and only if the initiator did not learn of any outstanding proposal made by
rngr; that is if Submissionsr(x) =
We say Submissionsr(x) is bivatent if it contains two distinct values. Corollary 5.1 follows
by examing procedure DeterminePwposal in the Appendix. it shows that all reconfigurers
either propagate mgr `s unique submission for view x or propose myr `s removal.
Corollary 5.1 Let r and r' be reconfigurers proposing version x. Then if both their
Submissions(x) sets are bivalent, they are identical:
Submissionsr(x) I = I Submissionsr#(x) I = 2) ?
Submissionsr(x) = Submissions?i (x).
28
With these preliminaries we can now prove only one of these two proposals could possibly
have been committed (invisibly or otherwise), and that all reconfigurers can distinguish
which of the two it was. This proposition is vital to the inductive step: it shows that in
going from Gpviewz?i to Gpviewr one and only one value can be committed by any member
of Gpviewx?i as the same value is proposed by any successful initiator for the ?th group
version.
Proposition 5.7 Let r be a reconfiguration initiator. If Acks(r,R-int(state(r))) is a ma-
jority subset of LocaIViewr and Su6missions?(x) is bivalent, then r can distinguish which of
the two values proposed could not have been committed invisibly.
Proof Let r be as described, and let Submissionsr(x) = v, x >, ? v', x >J. Let p be
the process of least rank among those reported to have submitted < v, x >, and let p' be the
process of least rank among those reported to have submitted < v', x>. r must decide which
of the two, p or p', could not have been successful for version x. We show that r chooses
correctly when it is the first bivalent reconfigurer for version x, then prove the proposition
inductively.
In order for either value to have been committed, its initiator must have garnered majority
approval from its local view for the submitted value. Since both p and p' make version x
submissions, both must have local version x --H 1 (Proposition 5.3). Without loss of generality
assume rank(p) <rank(p'), and consider the possible roles p could have had:
a) If p were the mgr, its proposal M-sub(< v, x>) could not have reached a majority
subset of Gpviewx?l; if it had, then p' would have learned of it from some process in
Acks(p', R-int(x --H 1)). Since r is the first bivalent reconfigurer, Submissions?#(x) would
have to be the singleton ?< v, x >1, which P1 would have propagated in Determine-
Proposal.
Thus, <v, x> is not stable because P cannot have been successful for < v, x>. Look-
ing at DetermineProposal, initiator r propagates <v1, x> because it was submitted
by P1, the initiator with least rank among those mentioned in Submissionsr(x).
b) If P # mgr, it is successful for <v, x> if and only if Acks(p, R-sub(< v, x >)) is a
majority subset of Gpviewz?i. Both P and P1 were able to make proposals so their first
response sets were majority subsets. Let A be their intersection:
A = Acks(p,R-int(x --H 1))flAcks(P1,R-int(x --H 1)).
The gossip property and rank(p) < rank(p1) mean that a majority of P'? local view
believe it faulty upon receiving R-int(x --H 1) from p1. Disconnect means that
29
recv??(?,R--Hint(x--H1)) recva(P1,R?int(x?1)) ?faultya(P) Va?A.
The question is whether any a E A receives R-sub(< v, x>) from p, which could only
happen before it receives R-int(x --H 1) from p':
recva(P,R?sub(< v,x>)) H recv??(?1,R-int(x --H 1)) H faultya(p).
Now, any such a would have forwarded <v, x> as part of NextUpdatea to p' when
responding to R-int(x --H 1), in which case either
1. Submissions?#(x) is bivalent, violating the assumption that r is the first bivalent
reconfigurer, or
2. every process in Acks(p', R-int(x --H 1)) reported <v, x > as its next pending up-
date. But in this case, Submissions??(x) would again be the singleton f< v, x
which p' would have propagated in DetermineProposal.
Thus, no a E A received R-sub(< v, x>) from p, meaning the only processes that
could have are those in Acks(p, R-int(x --H 1)) --H A. This cannot be a majority subset of
Gpviewz?i since it is disjoint from Acks(p', R-int(x --H 1)) which is a majority subset.
We have just proven the base case for the proposition --H when r is the first bivalent re-
configurer (it is not hard to see that `first' is meaningful and well-defined in the context
of successful initiators for a given group view). If r's proposal reaches a majority subset
of GpView??1 then the value it propagated will be chosen by the next reconfigurer to
get a majority response to its Reconfiguration Phase I interrogate message as r would
be the submitter with least rank. If r's proposal does not llreach a majority subset,
the next bivalent reconfigurer will nonetheless choose as r did and so propagate the
correct value.
Corollary 5.2 If Gpviewx?l is defined, there is at most one stably-defined proposal for
group version x.
Proof Proposition 5.7 proves that &etStablePrn?osaJ correctly chooses the only proposal
for a given group view that could have been committed invisibly to a reconfiguration initiator
when its Phase I response set is bivalent. When the set is univalent or empty, it is not hard
30
to see that Determ?nePropos&i is safe. If this initiator reaches its commit stage, its proposal
is stably-defined and identical to the other stably defined proposals for version x.
Theorem 5.1 (Identical Local Views) If Gpviewx?i is defined, then all members that
survive to define local version x have identical local x views.
Proof The result follows from Corollary 5.2; no process commits a local view for version
x that differs from any other processes' version x because all proposals that can possibly
reach the commit stage are identical.
Note that Theorem 5.1 implies no temporal constraints on local views, merely that if p
ever defines an xth local view, and if q, too, ever defines an xth local view, then these two
are identical. It does not require LocaIviewz? and Localviewxq to exist together in some global
state. Thus, to prove S-GMP satisfies GMP-3 requires slightly more work.
5.2 Message Complexity
[16] proves S-GMP (with two minor modifications) is message minimal for Strong GMP.
Moreover, s-GMP is also phase-minimal. The message-minimality proof gives the required
direction of information flow as well as the content of each message. In s-GMP the pattern
of required communication is arranged to minimize the length of the message-path from the
beginning of the update algorithm to the end. For example, Figure 11 shows two ways to
organize the distributed event "send a message to every process in S and collect responses
or time-out".
Observe that the Phase I submit message is unnecessary if mgr knows a majority of the
non-faulty outer processes already believe a process, say q, faulty. In this light, a contingent
update, piggy-backed on a commit message, can serve as the submit message for the next
view change. We can thus compress successive instances of Simple s-GMP if mgr makes
known when it multicasts the commit message, exactly how it plans to change the group
view next. In Figure 12, process q' crashes before responding to M-sub(--Hq), causing mgr to
suspect q' faulty. By appending M-sub(--Hq') to M-com(--Hq), mgr indicates that it wants to
remove q from the just-formed group view. Outer processes respond to the piggy-backed
as they would respond to a plain submit message. The correctness
and 5.7 in the previous section require only slight modifications to
commit-submit message
proofs (Propositions 5.5
handle this optimization).
31
p
5
p
5
is the Timeout interval p is willing to wait for any response.
Figure 11: Two possible communication patterns accomplishing "send a message to every
process in 5 and collect responses or time-out."
M-sub(--Hq)
mgr
p
fauttyrngr (q')
M-com(--Hq):M-sub(--Hq')
_______			M-com?--Hq')
crash
Phase I
Compressed Phase II and Phase I
Figure 12: Compressing Successive Instances of Simple S-GMP.
r
32
def
When we can take advantage of compressing phases we gain substantially. Define TLx =
GpView? , and let a? be the number of processes added to GpViewx and r? be the number
of processes removed from Gpviewx. Then Y successive compressed updates (with no re-
configuring) requires and initial nx submit messages, flx --H r? acknowledgement messages,
a handshake of 2a? State-Xfer and ack(State-Xfer) messages, n? --H r? + a? commit
messages. To update the new Gpviewz+i, there are n? --H r? + a? --H r?+i messages to ac-
knowledge M-sub(< Vx+i, x + 1 >), followed again by 2a?+i messages for the State-Xfer-
ack(State-Xfer) handshake, and n? --H r? + a? --H r?+i + a?+i commit messages...
k
n? + ?
+ 2ak + (nz --H k
+
x+Y
(2Y+ 1)nx+ ? ak+2 Z(x+Y--H k)?
where ? = ak --H rk. When we cannot take advantage of piggy-backing, there are
additional messages.
6 Conclusion
x+Y
Ynx + zy+Y--Hk)?
k=x
We have described an approach to the asynchronous system membership problem which
provides very strong distributed consistency guarantees, and yet is inexpensive in comparison
even to less powerful membership services. Current distributed systems lack membership
services, forcing application designers to solve this problem repeatedly through ad-hoc, and
often inconsistent, mechanisms. As technology such as GMP becomes more widely available,
we believe that a major obstacle to reliable distributed software development will have been
removed.
References
[1] A. El Abbadi and 5. Toueg. Maintaining Availability in Partitioned Replicated
Databases. ACM Transactions on Database Systems, 14(2):264--H29O, June 1989.
[2]
Y. Amir, D. Dolev, 5. Kramer, and D. Malki. Membership Algorithms in Broadcast
Domains. In A. Segall and 5. Zaks, editors, Proceedings of the Sixth Workshop on
Distributed Algorithms and Graphs; Israel. Springer-Verlag, 1992. Lecture Notes in
Computer Science.
33
[3]
Y. Amir, D. Dolev, 5. Kramer, and D. Miaki. Transis: A Communication Sub-System
for High Availability. In 22nd Annual International Symposium on Fault-Tolerant Corn-
puting (FTCS), pages 76--H84. IEEE, July 1992.
[4] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery
?n Database Systems. Addison-Wesley, 1987.
[5] K. Birman and T. Joseph. Exploiting Virtual Synchrony in Distributed systems. In
Proceedings of the 11th Symposium on Operating System Principles, November 1987.
[6] K. P. Birman and T. A. Joseph. Reliable Communication in the Presence of Failures.
ACM Transactions on Computer Systems, 5(1):47--H76, February 1987.
[7] M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of
Distributed Systems. A.C.M. Transactions on Computer Systems, 3(1):63--H75, 1985.
[8] B. A. Coan and G. Thomas. Agreeing on a Leader in Real-Time. In Proceedings of the
11th Real-Time Systems Symposium, pages 166--H172, December 1990.
[9]
F. Cristian. Reaching Agreement on Processor Group Membership in Synchronous
Distributed Systems. Technical Report RJ 5964, IBM Almaden Research Center, August
1990. Revised from March, 1988.
[10j M. J. Fischer, N. A. Lynch, and M. 5. Paterson. Impossibility of Distributed Consen-
sus with One Faulty Process. Journal of the Association for Computing Machinery,
32(2):374--H382, April 1985.
[11] L. Lamport. Time, Clocks and the Ordering of Events in a Distributed System. Com-
munications of the A.C.M., 21(7):558--H565, 1978.
[12] K. Marzullo, K. Birman, R. Cooper, and M. Wood. Tools for Distributed Application
Management. Technical Report TR 90-1136, Cornell University, June 1990.
[13] P. M. Melliar-Smith, L. E. Moser, and V. Agrawala. Membership Algorithms for Asyn-
chronous Distributed Systems. In Proceedings of the IEEE 11th International Confer-
ence On Distributed Computing Systems, May 1991.
[14]
5. Mishra, L. L. Peterson, and R. D. Schlichting. A Membership Protocol Based on
Partial Order. In Proceedings of the IEEE International Working Conf on Dependable
Computing for Critical Applications, February 1991.
[15] R. F. Rashid. Threads of a New System. Unix Review, 4:37--H49, August 1986.
[16]
A. M. Ricciardi. Practical Utility of Knowledge-Based Analyses : Optimizations and
Optimality for and Implementation of Asynchronous Fail-Stop Processes. In Fourth
Conference on the Theoretical Aspects of Reasoning About Knowlege. Morgan Kauf-
mann, March 22-25 1992.
[17] A. M. Ricciardi. The Asynchronous Membership Problem. PhD thesis, Cornell Univer-
sity, January 1993.
34
[18]
A. M. Ricciardi, K. P. Birman, and P. Stephenson. The Cost of Order in Asynchronous
Systems. In A. Segall and 5. Zaks, editors, Proceedings of the Sixth Workshop on
Distributed Algorithms and Graphs; Israel. Springer-Verlag, 1992. Lecture Notes in
Computer Science.
[19] R. D. Schlichting and F. B. Schneider. Fail-Stop Processors: An Approach to Designing
Fault-Tolerant Computing Systems. ACM TOCS, 1(3):222--H238, August 1983.
[20] F. B. Schneider. Byzantine Generals in Action: Implementing Fail-Stop Processors.
ACM TOCS, 2(2):145--H154, May 1984.
[21] M. D. Skeen. Crash Recovery in a Distributed Database System. PhD thesis, University
of California at Berkeley, May 1982.
35
7 Appendix: The S-GMP Algorithm
We abbreviate "either add or remove Q" with ?Q. If Q is a set or process identifiers,
Mcas4(Q,m) denotes the compound action vq E Q : (sen?(q,m)). Mcas4(Q,m) is an
indivisible action only in the sense that p does not execute any other events until all messages
are sent; it is not failure-atomic. The message ack(rn) acknowledges receipt of message
m. We do not explicitly show gossiping, or channel-disconnect, but assume these are done
transparently.
Task: mgr
while (true)
repeat
GetUpdate(vl);
until (vi $ nil-id);
Mcastmgr (LocalViewmgr ,M- sub(+vl));
while (vi $ nil-id) /* Compressed algorithm loop. */
forall p ? LocalViewmgr
await either recvmgr (p, ack(M-sub(+vl))) or faultymgr (p);
if (majority of LocalViewmgr didn't respond)
crashmyr;
DoCommit(v1, I); /* Update LocalViewmgr according to +. ?/
GetUpdate(v2);
if (Joining new members)
Mcastmgr (vi, Join: State-Xfer);
forall p' C vi
await either recvmgr (p', ack(Join) : NextUpdate?') or fattltyrngr (pi);
if (NextUpdatev1 $ I)
v2			Nextupdatevi;
Mcastmgr (LocalViewmgr , M-com(Ivl) : M-sub(+v2));
vi H v2;
/? end mgr Task ?/
36
Task: Outer Processes, p
recv?(mgr ,M-sub(+vl));
DoPreCommit(v1, I); /* Mark vi faulty or operational. */
repeat
send?(mgr , ack(M-sub(+vl)));
await either recv?(mgr , M-com(+vl) : M-sub(+v2)) or fau1ty?(mgr);
if (!FAULTYp(mgr))
DoPreCommit(v2);
DoCommit(vl, I);
vi & v2;
else Wait-Reconflguration();
until (vi = nil-id);
/* end Outer Process Task */
Reconfiguration
Let p have local version x --H ?. For Reconfiguring, we use the following variables:
o+ NextUpdate? is a tuple of the form [? v, x >, i]p, where < v, x > is the value p is waiting
to commit to form Locaiviewx?, and rank(?) is the rank (in LocaIView??--H1) of the initiator
that submitted <v, x>. When p receives a submission it changes NextUpdate? to
reflect the value proposed and the initiator proposing it.
o+ LastCommit? is value p committed to form LocaIviewx??l.
o+ state(p) is the triple [ver(p), NextUpdate?, LastCommit?].
o+ Aheadr is the set values reported committed for versions numbered greater than ver(r).
Initiator r receives these values in response to its R-int(state(r)) message. In actuality,
the only reported version in Aheadrcan be ver(r) t 1.
o+ SubCurrentr is the set of NextUpdate values r receives with proposed versions equal to
ver(r) + 1; SubAheadr is the set with proposed versions greater than ver(r) t 1.
37
Task:ReconfigurationInitiator,r,withver(r)=x
Mcas4(LocalViewr, R-int(state(r)));
forall p E LocalViewr
await either recvr(p, state(p)) or faultyr(p);
if (majority of LocalViewr didn't respond) crash?;
/* Determine the value and version to submit from the responses received. ?/
DetermineProposal(vl, ver, v2);
DoPreCommit(v1);
Mcas?(LocaIViewr,R-sub(< ::::vl,ver>));
forall p ? LocaIView?
await either recvr(p, ack(R-sub(< Ivi, ver>))) or faullyr(p);
if (majority of LocalViewr didn't respond) crasI4;
DoCommit(v?);
if (Joining new members)
Mcas4(vl, Join: State-Xfer);
forall p' E vi
await either recvr(vl, ack(Join) : NextUpdate?') or faultyr(p');
if (Nextupdatev1 # I)
v2 & NextUpdate?i;
Mcas4(LocalViewr, R-com(< +vl, ver>) . R-sub(+v2));
mgr, vi & r, v2;
Begin myr Task;
38
Task:OuterReconfiguration,p
recv?(r, R--Hint(state())r);
if (rank(p) > rank(r))
crash?
/* Catch up to r if necessary
if (ver(p) <ver(r))
DoCommit(LastCommitr);
state(p)			state(r);
*1
sen?(r, state(p));
await either recv?(r, R-sub(< Ivi, ver>)) or fauUy?(r);
if (not FAULTYp(r))
DoPreCommit(vl);
sen?(r, ?ck(R-sub(< +vl, ver
await either recv?(r,R-com(? Ivl,ver >) : M-sub(1v2)) or fanfly?(r);
if (not FAULTYp(r))
DoCommit(vl);
mgr, vi & r, v2;
else Wait-Reconfiguration();
else Wait-Reconfiguration();
39
/* Sets parameters proposal, version, and invisible. Let ver(r) = x, */
Procedure: DetermineProposal(OUT <proposal, version >, OUT invisible);
Aheadr & f[Ids,ver(p)]? ver(p) = (x + 1));
SubAheadr & ([ids,ver(p) + 1,rank(init)]p ver(p) = (x + 1));
SubCurrentr & f[Ids,ver(p) + 1,rank(init)]? I ver(p) = xl
if (Aheadr$ $)
/* Partially committed version x + 1. */
proposal & Aheadr;
GetStableProposal(invisible, Su bAheadr);
verswn & x + 1;
return();
/* All respondents report the same local version. */
verston & x + 1;
if (SubCurrentr is empty)
proposal & <--Hmgr,x+1>;
GetUpdate(invisible);
return();
if (SubCurrent? is a singleton)
proposal & SubCurrentr;
Getupdate(invisible);
return();
/* SubCurrentr has two elements. */
GetStableProposal(proposal, SubCurrentr);
Getupdate(invisible);
return();
1* update-set has no more than two elements. */
Procedure: GetStableProposal(OUT <val, ver>, IN update-set)
< val, ver> & the element of update-set with the lowest ranked initiator.
return();
40
