BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1339
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Virtually-Synchronous Communication Based on a Weak Failure Suspector
AUTHOR:: Schiper, Andre 
AUTHOR:: Ricciardi, Aleta M.
DATE:: April 1993
PAGES:: 22
ABSTRACT::
Failure detectors (or, more accurately, Failure Suspectors - FS) appear to be 
a fundamental service upon which to build fault-tolerant, distributed 
applications. This paper shows that a FS with very weak semantics (i.e. that 
delivers failure and recovery information in no specific order) suffices to 
implement virtually-synchronous communication (VSC) in an asynchronous system 
subject to process crash failures and network partitions. The VSC paradigm is 
particularly useful in asynchronous systems and greatly simplifies building 
fault-tolerant applications that mask failures by replicating processes. We 
suggest a three-component architecture to implement virtually-synchronous 
communication : 1) at the lowest level, the FS component; on top of it, 
2a) a component that defines new views, and 2b) a component that reliably 
multicasts messages within a view. The issues covered in this paper also lead 
to a better understanding of the various membership service semantics 
proposed in recent literature.
END:: CORNELLCS//TR93-1339
BODY::
Virtually-Synchronous Communication
Based on a Weak Failure Suspector
Andre Schiper
Aleta Ricciardi*
TR 93-1339
April 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*The first author is on leave from Ecole Polytechnique Federale de Lausanne,
Switzerland. His research is supported by the "Fonds national suisse" under contract
number 21-32210.91, as part of the European ESPRlT Basic Research Project
Number 6360 (BROADCAST). The second is supported by DARPA/NASA Ames
Grant NAG 2-593, and by grants from IBM and Siemens Corporation.
Virtually-Syncliroiious Commuiiication
Based on a Weak Failure Suspector
Andre' Scliiper, Aleta Ricciardi*
Department of Computer Science, Upson Hall
Cornell University
Itliaca, NY 14853-7501
Apn'l 28,1993
Abstract
Failure detectors (or, more accurately Pailure Suspectors FS) ap-
pear to be a fundamental service upon which to build fault-tolerant,
distributed applications. This paper shows that a FS with very weak
semantics (i.e. that delivers failure and recovery information in no
specific order) suffices to implement t?irtua1iy-synchronous cornrnvnz-
cation (VSC) in an asynclironous system subject to process crash fail-
ures and network partitions. The VSC paradigm is particularly useful
in asynchronous systems and greatly simplifies building fault-tolerant
applications that mask failures by replicating processes. We suggest
a three-component architecture to implement virtually-synchronous
communication : 1) at the lowest level, the FS component; on top of
it, 2a) a component that defines new views, and 2b) a component that
reliably multicasts messages within a view. The issues covered iii this
paper also lead to a better understanding of the various membership
service semantics proposed in recent literature,
The first author is on leave from Ecole Polytechnique Fe'de'rale de Lausanne Switzer-
land. His research is supported by the 4Fonds national suisse" under contract number
21-32210.91, as part of the European ESPRIT Basic Research Project Number 6360
(BROADCAST).The second is supported by DARPA/NASA Ames Grant NAG 2-593,
and by grants from IBM and Siemens Corporation.
1 Introduction
There have recently been several papers about membership services in asynchronous sys-
tems [2, 12, 13,17, 18, 19, 20]. A membership service is responsible for giving each process
(consistent) information about the operational processes in the system. A process calls this
information its view of the system processes. A membership service typically reacts to process
crashes or recoveries, leading it to define a set of views. The membership services mentioned
vary according to the underlying failure model considered, as well as the properties they
provide with respect to the set of views delivered to each process: (e.g. whether another
view may exist simultaneously, the degree of agreement among members):
o+ [17, 18] consider processes with crash failure semantics, excluding network partitions.
o+ [19, 20] consider systems in which processes may crash and the network may partition.
llowever, despite network partitions, this membership service defines only majority
v'ews --H a unique, totally-ordered sequence of views. Such a membership service is said
to have linear semantics.
o+ The membership services described in [1, 2, 13] consider the same failure scenario as
above, but only define a partial order on the views. That is, if the system is partitioned
in two (or more) subuetworks then two (or more) views, one in each subnetwork, may
exist concurrently.
Concurrent views offer an interesting extension to membership services, and force us to
consider a further semantic distinction based on whether concurrent views are permitted to
intersect. If two concurrent views may overlap, we say the membership service semantics
are weak-partial, if they may not we say the semantics are strong-partial. Among those
that permit concurrent views, [2] appears to be a strong-partial membership service. [13]
considers both strong-partial and weak-partial membership services, and [1] and [12] consider
only weak-partial membership service. These variants raise a new, pertinent question: when
is a strong-partial service required, and when does a weak-partial membership service suffice.
The objective of this paper is to suggest an answer to this question, by showing that a strong-
partial membership service is intimately related to virtually-synchronous communication. We
do not discuss when a linear membership service is required.
The idea of virtually-synchronous communication (VSC) was first introduced by Isis [3, 4].
VSC can be understood as rule for ordering message deliveries (reliable multicasts) with
respect to view changes (received from the membership service). We give a precise definition
2
for VSC in Section 5.4. VSC defines a powerful model for building fault-tolerant processes
that mask failures by replication. It has also been argued [5] that ordering message deliveries
consistently around process failures and recoveries is a fundamental part of any distributed
computation; thus VSC is a vital primitive for inherently-distributed programming. Relat-
edly, many common distributed applications are more easily understood and solved if they
can make use of VSC [21]. Finally, if the VSC abstraction we define in this paper is aug-
mented with a majority requirement, [22] shows it is a powerful modd in which transaction
commit is easily (albeit probabilistically) implemented. Understanding that the VSC ab-
straction is more basic than the transaction abstraction gives broader insight to the problem
of building fault-tolerant applications. However, we note that solving VSC is not equivalent
to solving consensus [10].
Traditionally virtually-synchronous communication has been implemented with a two com-
ponent architecture: a membership service, and on top of it, multicast component. However,
understanding the relationship between a membership service and virtually-synchronous
communication has lead us to consider a three-component architecture, with (1) a Fail-
ure Suspector component FS delivering information about the communication topology, (2)
a View Component VC defining views, and (3) a Multicasi Component MC implementing
virtually-synchronous communication. We divide the functionality of the traditional mem-
bership service between our FS and VC components.
In addition to increasing our understanding of the relationship between any membership
service and virtually-synchronous communication, this architecture allowed us to specify
precisely the FS semantics needed to guarantee VC and MC liveness. One weakness of
previous work in this area has been a lack of precise semantics for the FS part of the system.
Explicitly, the paper shows:
o+ that virtually-synchronous communication satisfying the definition given in Section 5.4
can be implemented with a modular, three-component architecture for system models
with both process crash failures and network partitions (i.e. link failures). We start
with a very simple model, and from it construct a useful communication primitive for
fault-tolerant, distributed applications.
o+ how to define concurrent views that have empty intersections. That is, how to imple-
ment strong-partial membership semantics in a system that may partition. The basic
idea is to define a view as a set of pairs (proc id, proc sequence number).
3
that if we remove the MC component from the architecture (e.g. if virtually-synchronous
communication is not needed), then the view component defines views that do not
satisfy the empty intersection condition (i.e. giving a membership service with a weak-
partial semantics).
Section 2 describes our low-level system model and the interaction of the three components.
Section 3 gives a precise semantics for the failure suspector. Sections 4 and 5 sketch how to
implement the VCp and MCp components, and Section 6 completes the VCp and MCp protocols.
We conclude in Section 7.
2 System Model
Our low-level system model consists of an infinite name space of process identifiers, Proc --H
[?i, .2...., ?. The name space is infinite to model infinite executions in which processes
continually fail and recover. At any point in time, however, there are only a finite number of
executing processes under consideration and we restrict our attention to these. For this finite
set of executing processes, we assume a completely-connected network of FIFO channels.
Processes communicate by passing messages over these channels, though they too may fail.
The system has no global clock, and message transmission delays are unbounded. Processes
fail by crashing, which we model by the local event crash?. We model the recovery of a
process with a new identifier. A process p may (1) send a message to another process, (2)
deliver a message sent by another process q, and (3) perform local computation.
A history, hp, for process p is a sequence of events beginning with the event star4 and
terminating, if at all, with the event crash?: hp = star4 e?1 e?k, for 0 < k. A cut is an
n-tuple of process histories, one for each p E Proc. We assume familiarity with inter-event
causality [15] and with consistent cuts [8].
Crash failures are surprisingly difficult to handle in an asynchronous system. Fischer,
et.al [10] show that, because it is impossible to distinguish a crashed process from one
that is just very slow, any problem requiring "all correct processes" to agree on some value
cannot be solved deterministically; that is, no deterministic protocol can make progress if it
must also make accurate process failure detections. One way around this is for asynchronous
systems to incorporate some mechanism for suspecting failures, as well as a means of han-
dling failure suspicions consistently (e.g. p may suspect q faulty while r may not; perhaps
r and/or q even suspect p). Our system model assumes a failure suspector that eventually
4
new View?()
VCp			_______________			MCp
view termination
not-comm(q)
FSp			not-comm(q)
Figure 1: FSp, MCp, and VOp interaction for virtually-synchronous communication.
suspects a crashed process,1 which suffices to ensure our protocols make progress. We do
not require anything more of the failure suspector.
Each process has three components that interact to implement the virtually-synchronous
communication primitive for application-layer processes (Figure 1). The Failure Suspeclor
(FSp) is at the lowest level and notifies both the Multicast Component (MCp), and the View
Component (VCp) about suspected changes in the communication topology. Such changes
arise from actual process and link failures, as well as high processor loads and heavy net-
work traffic (indistinguishable from true failures). VCp defines p's current view, View?(), an
approximation of the set of processes with which p can communicate, and sends View?() to
MCp. MOp is responsible for reliably multicasting application-layer messages until it receives
an accessibility-change notification from FSp. These notifications signal a suspected change
in the communication topology and the attendant need to alter View?(). However, neither
MCp nor VOp can do this naively since virtually-synchronous communication requires that
members of View?() that also accompany p to its next view receive the same set of messages
that were multicast within View?() (We make this definition precise in Section 5). To ensure
this, MOp delivers all outstanding multicasts, and does not issue new multicasts except to
forward those that have been only partially delivered. View?() is safely terminated when all
multicast in it are delivered at all sites that MC believes non-faulty. When MOp
messages			p
detects this condition (Section 4) it informs VOp, which then determines a new view for MOp
from the accessibility notifications it received from FSp.
Section 3 describes the properties our Failure Suspector components must satisfy. These are
weak yet reasonable requirements, and are easily implemented in any asynchronous system.
Section 4 discusses VCp, and Section 5 discusses MOp. These components execute protocols
1This can easily be implemented with time-outs
5
to detect global properties [8,16].
3 The Failure Suspector
Given process P, FSp emits a sequence of not-comm(q) and comm(r) suspicion messages to MCp
and VCp. Since the system is asynchronous we cannot guarantee the accuracy or timeliness of
these suspicions; the most we can require is that FSp eventually suspects true crashes and re-
coveries. This is not unreasonable. It is known that fault-tolerant protocols in asynchronous
systems cannot make progress if they are required to make accurate failure determinations.
Our approach introduces an inaccurate failure suspector to gain liveness. On the other hand,
we cannot require FSp to suspect all periods of transient inaccessibility --H a network partition
may repair before it is noticed.
Since, in theory, FSp may suspect processes arbitrarily, we have divorced FSp implementation
from the problem at hand. In a real system, FSp might take cues from the underlying
communication layer, the operating system, response delays, and so forth.2
On every consistent cut c, FSp maintains two non-intersecting sets, CommSet?(c) and NotCommSet?(c).
When FSp suspects q E CommSet?(c), q is removed from CommSet?(c) and is thereafter a
member of NotCommSet?(c). Whenever these sets change, FSp notifies VCp and MCp by
emitting the appropriate comm() or not-comm() messages.
We have a reciprocity condition for (perceived) partitions, as well. To model the nature of
network partitions, we require eventual reciprocity of inaccessibility suspicions. That is, if
FSp suspects q then eventually either FSq suspects p or q fails.
A logical formula holds on a consistent cut. The membership of an indexical set of processes
depends on when it is considered. In our model, `when' translates to consistent cuts, the only
physically-realizable instances. We use the following formulas and indexical sets to specify
the behavior of FSp.
o+ NoTCoMM?(q) holds on c if q E NotCommSet?(c)
o+ CoMM?(q) holds along c if q ? CommSet?(c)
o+ DOWNq holds along c = (h1,. . . , hq,. . . , hn) if crashq is the last event in hq
2For example, to detect failures FSp could query a process, deeming it inaccessible if it does not repond
in a timely fashion (inaccurate, but satisfying the requirement). We might put the onus on a process to
announce its recovery.
6
o+ UPqholdsalongc=(hi,.. . , hq,. . . , hn) if crashq is not an event in hq.
Non-triviality Conditions for FSp
Crashes If q crashes, then eventually either p crashes or FSp suspects q is unreachable:
DOWNq ? ?(NoTCoMMp(q) V DOWNp)
Recoveries If q begins executing and is reachable, then eventually either p crashes or FSp
suspects q is reachable:
UPq ? ?(co?Mp(q) V DOWNp)
Reciprocity If FSp suspects qis inaccessible, then, if qdoes not crash, it eventually suspects
p is inaccessible:
NOTCOMMp(q) ? ?(DowNq v NOTCOMMq(P))
This is an artifact of p suspecting q: since pceases communicating with q,pis, in fact,
inaccessible to q.
Propagation Conditions for FSp
Finally, we require failure suspectors to gossip among themselves.
Inaccessibility Propagation If FS? believes, on cut c, it cannot communicate with q then
it tries to propagate this belief to every FSr for r E CommSet?(c):
NoTCoMM?(q) ? ?(NoTCoMMr(q) v NOTCoMMr(P))
Accessibility Propagation If FSp believes, along c, it can communicate with q then it
tries to propagate this belief to every FSr for r E CommSet?(c):
CoMM?(q) ? ?(CoMMr(q) v NOTCOMMr(P))
7
3.1 Related work
Before discussing the other components, we discuss the relation between this and other work.
In [7], Chandra and Toueg solve Distributed Consensus in an asynchronous system using a
Failure Suspector, W, that satisfies certain (weak) requirements. [6] further shows that W
is the weakest suspector that can be used to solve Distributed Consensus. While we do not
consider consensus in this paper we said in the Introduction that adding a majority require-
ment to the VSC abstraction, gives a simple, probabilistic solution to transaction commit.
Since there are no fundamental differences between solving consensus and atomic commit
problem, how are both approaches related (we will not, hereafter, distinguish consensus from
atomic commit)?
First it should be clear that our Failure Suspector is not weaker than W. More important,
[7] also places a majority requirement on processes before W can be used to solve consensus.
To relate the two approaches, consider a generalization of consensus:
o+ suppose consensus is to be solved more than once, and let consensus(?), for i > 0, be
the jth instance of the consensus problem;
o+ let Proc be the initial set of processes that solve consensus(1);
o+ consensus(i t 1) begins only after consensus(?) has been solved;
o+ for consensus(i), i > 1, the processes chose their initial state randomly from the set
In [7], consensus(i) (for each i) would be solved by the same static set of processes Proc. The
majority requirement to solve consensus(?) is thus similar to a static voting scheme in the
context of handling replicated data [11]. This is because [7] consider that failure suspicions
are never stable: a process p believing faited(q) can always change its mind.
In contrast, in the VSC model, failure beliefs are stable each time a new view is defined.
Thus for i ? j, consensus(i) and consensus(j) need not be solved by the same set of
processes. Continuing the replicated data analogy, the majority requirement in the VSC
model is similar to the dynamic voting scheme [9], which has been shown to lead to higher
data availability than the static voting scheme.
8
4 The View Component
The view component operates whenever a link failure repairs, a process begins executing,
recovers after a crash, and whenever the multicast component informs it that the current
view has terminated (Section 5). VCp defines p's current view by interaction with other vc
components, and by using FSp information.
VCp defines a new view when it detects (or learns about through some other vc component)
agreement on CommSet?() among the members of CommSet?(). The new view will be the
largest subset of processes (containing p) satisfying this agreement.
4.1 The View Component Algorithm
In this section we outline how VCp detects or learns about CommSet,,() agreement.
When VCp is activated, it knows a near approximation of CommSet?() from FSp.3 Whenever
VCp receives an comm(r) message from FSp, it updates this approximation. Along cut c, VCp
uses a deterministic function, vc-Coord(p),4 on the set CommSet?(c) which returns a unique
process identifier, and satisfies
(Commsetp(c) = Commsetq(c)) ? (vc-Goord(p) = vc-Coord(q))
For example, vc-Coord(p) might be "choose the `smallest' identifier from CommSet?(c)."
Each process also maintains a local counter, seq?, which is incremented every time VCp con-
siders vc-Coord(p) to have changed (this is not necessarily every time CommSet?(c) changes.
For liveness, however, vc-Coord(p) must change when VCp receives not-comm(vc-Coord(p))
from FSp). The counter 8eq? is initially zero and is essential in allowing us to define non-
intersecting, concurrent views. The tuple (p, seq?) fully describes p on any consistent cut.
Finally, the formula CoMMSETEQ(S) holds on c if and only if all p Ei S have identical
CommSet() sets at c. That is,
CoMMSETEQ(S) %ef
A (CommSetp() = Commsetq())
p,qES
3There may be notifications from Fsp that have not yet reached vc?.
4Technically, we should name some cut explicitly since the function's value depends p's indexical can-
communicate-with set. We omit the cut reference, but with the understanding that vc-Coord(p) has a
temporal dependence. In fact p never knows which particular cut it is on, but at any point in its execution
vc? has some set of process identifiers that satisfy a certain condition. It determines a coordinator by
applying some rule to this set. The presence of c would only clarify matters for the omniscient reasoner.
9
In our protocol, each p sends its current CommSet?() and current seq? number to vc-Coord(p)
every time CommSet?() changes.
4.2 Defining the New View
Let `c = vc-Coord(p), and S = CommSet?(c) for some cut c. Then VCN receives CommSet?()
for p ? S. Whenever it receives a different CommSet?() from some p, VC? discards the
previous one and checks whether CoMMSETEQ(CommSet?()) holds. If it does, VCh sets the
new view, View?(), to
View?() = V = ?(p, seq?) p E CommSet?()?
(t)
The coordinator tc then sends the new view to each VCp (for p ? V) which then delivers the
view to MCp. MCp regains execution control and begins multicasting again. Unfortunately,
as CoMMSETEQ(CommSet?()) is not a stable property (i.e. once true, forever true) we must
take care in announcing the new view. We return to this issue in Section 6.
4.3 The Partial Order
Correctuess of vc means that the coordinator successfully sends the new view to the vc
components of all reachable members in the new view. We will henceforth use V to denote
the (local) view that is agreed-upon by all the members of V
Since process histories are linear, it makes sense to talk about the x?h version of a process's
(local) view - we denote this by view?x.
Definition Given two agreement views V and V' V?1V' if and only if there is some p in
V n V' such that V = vIew?x, and V' = View??+1. The transitive closure of ?i is denoted
?. I
It is not hard to see that the views defined by the collection of vc? components are partially
ordered by ?. We say V and V' are concu?ent if and only if they are not ?-related.
Proposition 4.1 trivially follows from the definition of views (Equation 1) and the increment
rule for seq?.
Proposition 4.1 Let V and V' be concur??nt mews. Then Vfl V'
10
5 The Multicast Component
The Multicast Component of process p, MCp, is responsible for implementing virtually-
synchronous communication. MCp operates in two modes. In one mode it multicasts messages
to the members of its current view View?(). In the other mode, it flushes outstanding multi-
casts to ensure they satisfy virtually-synchronous communication semantics, then terminates
the current view. The transition from multicast mode to termination mode is triggered by
any FSp not-comm() or comm() message. In this section, we define VSC semantics and the
protocols MCp uses.
5.1 Definitions
Informally, virtually-synchronous communication is such that, for any view V, the processes
of view V that mutually believe each other alive deliver the same set of multicasts.5 To make
the definition of VSC precise we need to define formally the set of messages considered to
have been multicast in V, as well as the subset of processes that deliver them.
Definition Given a view V, message m is a V-multicast if it was sent by some p along a
cut c such that View?(c) = V. 1
Definition (VSC) Let V?1V'. Then communication in a system is virtually-synchronous
if and only if all processes in V and in V' delivered the same set of V-multicasts. Moreover
no message is delivered in more than one view. 1
It is important to notice that process sequence numbers are not used in the definition.
These are low-level pieces of information; the application layer should only be concerned
with process identifiers. For an application-layer process, VSC ensures two processes that
if they progress together from one view to another, then they delivered the same set of
messages in the first view. As a result, if process state is determined by an initial state and
the set of multicasts delivered to the process, VSC means that if processes begin executing
in view V in the same state, then switch together to view V', they will begin executing in
V' in the same state.
5For simplicity, we omit other forms of communication. Non-multicast communications do not introduce
new problems.
11
5.2 Two Modes of Operation
The component MCp operates in two modes:
1. in normal mode MCp reliably multicasts messages issued by the application layer of p,
and delivers to the application layer multicasts it receives from other Mcs;
2. in view-termination mode MOp does not multicast new messages; instead it attempts
to flush outstanding multicasts to ensure the VSC semantics.
After receiving a view from VOp, MOp is in normal mode. It enters view-termination mode as
soon as it receives any (in)accessibility notification from FSp. When view-termination mode
ends, MOp gives control back to VOp. MOp is inactive until it receives a new view from VOp,
whereupon MOp begins normal mode again.
5.3 MCp Normal Mode
Suppose VCp defines a view V = View?() and delivers this to MOp. Recall that views are sets
of tuples, which we call process signatures:
View?() = fcrq = (q,seqq)?.
Upon receiving View?(), MOp enters normal mode, in which it multicasts and delivers mes-
sages. Each message m issued by the application layer of process p is multicast by MOp to
all q Ei V Before issuing the message, MCp adds a? to m. Let sender(m) be the signature
of the process from which m originated.
When MOp receives a message the following sequence of events occurs:
1. MOp delivers ni (to the application layer) if sender(m) E V, and discards m otherwise;
2. MOp also buffers any message it receives and delivers in V until it knows all other
processes in V have received ??6 When m is received by all processes in V we say it
is stable.
By delivering only V-multicasts, the normal mode ensures that no multicast can be delivered
in more than one view (see the VSC definition).
6There are many standard ways of achieving this --H e.g. piggybacking information on messages.
12
5.4 MCp View-Termination Mode
Consider a view V = View?(). Component MOp switches from normal mode to view-
termination mode after receiving from FSp either 1) not-comm(q) for q e View?(), or 2)
comm(r) for r ? View?(). This is because whenever a change in the communication topology
is detected a new view must be defined reflecting that change. liowever, before defining a
new view, MC in view-termination mode must ensure the VSC definition is satisfied.
Once MOp enters view-termination mode, it need only consider relevant not-comm() events
from FSp to terminate V. Thus, while executing in view-termination mode, MOp builds its
own approximation of NotCommSet?(). This means failure notifications have a permanent
effect until view-termination mode ends: comm(q) received by MOp in view-termination mode
after not-comm(q) (for example due to a partition) cannot undo the not-comm(q) information.
Just as a new view for p is defined according to agreement on CommSet()s, successfully
terminating V involves partitioning V according to NotCommSet() agreement.
Definition The indexical set Survives?(V) is V minus the set of processes MOp believes failed
in V
Survives?(V) = V --H f(q, seqq) I NOTCOMMp(q)?
Before we can explain how to ensure VSC, we need the following data structures.
Definition Consider V = View?() and consistent cut c. The vector msg?(V c) (of size I V I)
is defined such that:
o+ its p?h component, msg?(V c)[p], is the number of V-multicasts that originated from p
(up to c);
o+ for q ? V, q # p, its qth component, msg?(V c)[q], is the number of V-multicasts MCp
delivered up to c that originated from q. I
Definition (View Terminated) Consider view V and 5 such that ? ? 5 C Ids(V) (where
Ids(V) is the set of process identifiers appearing in V). Then vT(V, 5) holds along cut c if
and only if
?,)5((ms9?(Vc) = msgq(Vc)) A (Survivesp(Vc) = Survivesq(Vc)))
It is not hard to see 5 = Ids(5urvives?(V)). I
13
In other words vT(V 5) is true exactly when the processes in 5 agree on both the messages
multicast in V and on their respective Survives(V) sets. For MCp, detecting termination of
V = View?() is thus reduced to detecting vT(V, 5) (for p Ei 5 C Ids(V)).
Having detected vT(V 5), whether 5 = Ids(V) or 5 C Ids(V) is important in determining
the new view. In the first case, whatever view, V', VCp later defines, VSC is satisfied with
respect to the pair (V V'). In the second case MCp must pass 5urvives?(V) to vc?; we will
want the new view to be a subset of 5urvives?(V).
To guarantee that every non-crashed process in V eventually detects vT(V 5) for some 5,
MCp behaves as follows in view-termination mode:
o+ it stops multicasting new messages;7
o+ it rejects any message m such that sender(m) ? 5urvives?(V).
o+ upon receiving not-comm(q) from MCp (for q E V), MCp signs and forwards any V
multicasts originating from q that are still in p's buffer (Section 5.3). MCp then removes
these messages from its buffer. MCq rejects the re-issued message if NOTCOMMq(P)
holds (i.e. if MCq has received not-comm(p) from FSq).8
Proposition 5.1 Consider view-termination mode as described above. Then for eachp Ei V,
there exists a set, 5? such that p Ei 5? and vT(V 5?) holds.
PROOF (sketch) We introduce the following notation:
o+ vT1(V,5) def Ap,q?5m5gp(V) = msgq(V)
o+ vT2(V5) %ef Ap,q?55UrVVeSp(V) = Survivesq(V))
Consider p Ei V. We build a sequence 5pO, ,5?, . 5pfl? where Vi, p E 5p? and 5p? C Ids(V),
such that finally vT(V 5pfl) holds. Initially take = Ids(V). The proof ends as soon as
vT(V 5p?) holds, for some i. If not, then vT1(V 5p?) or vT2(V 5p?) does not hold. We obtain
5p?+i from 5p? by removing a process (if necessary). Because (1) S?0 is finite, (2) the number of
messages sent in a view is finite once view-terminaton mode is started (processes do not issue
7If the network were a broadcast domain, MCp could continue multicasting using a new signature (p, seq? +
1). The problem for less general environments is that the new multicast view (destination set) is not yet
known.
8Duplicate messages are recognized and discarded as usual.
14
new multicasts in this mode), and (3) vT(V, fpj) is trivially true, the construction finally
ends with 5pfl such that vT(V? 5pfl) holds. We briefly discuss the proof reasoning for the case
when either vT1(V? ?t) or vT2(V, 5p?) does not hold.
(a) If vT1(V, 5p?) does not hold, then the message set of some q in 5p? differs from p's, in some
component:aq, r Ei 5p? : (msg?(V)[r] # msgq(V)[r]). If eventually these sets become equal,
then take 5p?+1 = Spt If not (i.e., msg?(V)fr] never equals msgq(V)[r]), then either DOWNr,
or NOTCOMMr(P), or NOTCOMMr(q) holds. So suppose NOTCOMMr(p) holds (analogous
arguments hold for NOTCOMMr(q) and DOWNr). Then eventually NoTCoMM?(r) holds
(from FSp Reciprocity). The Reissuing rule in view-terminaion mode means that p will
forward to q all messages it received from r that q did not. llowever, since the message
sets never agree this transfer will not succeed completely before NOTCOMMq(p) eventually
holds. Reciprocity ensures that NOTCOMMp(q) holds, and at this point we define 5p?+1 to be
--H ?qJ.
(b) If vT2(V, S\)) does not hold, then there is some q in 5p? such that Survives?(V) #
Survivesq(V). Without loss of generality let r Ei Survivesg(V) --H Survives?(V). Then In-
accessibility Propagation and Reciprocity mean that eventually either NOTCOMMq(r), or
NoTCoMM?(q) holds. In the first case 5p?+1 to be 5p? --H ?q); in the second case, take
5i+i = St. I
p p
5.5 An Algorithm to Detect vT(V, S)
Like the VCp algorithm detecting CoMMSETEQ(), the MCp algorithm detecting vT(V, S?)
relies on a coordinator process. MCp determines its view-termination coordinator with a
deterministic function, mc-Coord(p), on the set Survives?(V? c). We require that for p and q
in V, with identical Survives(V) sets, mc-Coord(p) = mc-Coord(q).
Let x = mc-Coord(p). Then x attempts to detect vT(V? Survives?(V)). MCp also increments
the sequence number counter, seq?, whenever MCp considers mc-Coord(p) to have changed (for
liveness, the function mc-Coord(p) must change whenever MCp receives not-comm(mc-Coor?i(p))
from FSp).
Process p sends msg?(V), Survives?(V), and seq? to mc-Coord(p) when MCp first considers
mc-Coord(p) to be its coordinator, and whenever msg?(V) and Survives?(V) are modified.
Ifx = mc-Coord(p), then:
A (ms?x(V) = msg?(V) A Survivesx(V) = Survivesp(V))
pESU rvives?(v)
15
vT(VSurvives(V)) ?
Proposition 5.2 Consider a view V, with p E V and the view-termination protocol de-
scribed above. Then eventually, either p crashes or it detects vT(V Survivesx(V)).
PROOF (sketch) The proof is similar to that of Proposition 5.1. Here, we consider the
perspective of ? = mc-Coord(p). The problem is that, due to transmission delays, x may
not detect vT(V? Survives?(V)) as soon as it holds (transmission of msg?(V) and Survives?(V)
from p to x).
There are two cases: eventually ? receives the messages enabling it to detect vT(x, Survivesx(V)),
or failures prevent ? from detecting it. In the second case, if both COMMp(X) and COMM?(p)
hold, we can use the iterative construction, from the perspective of ?, in the proof of Propo-
sition 5.1 Otherwise we must consider the iterative construction with respect to ?`. the
coordinator replacing ? once it is no longer a member of Survives?(V). 1
Finally, the fact that vT(V? S) is not stable poses the same problems as those posed by
COMMSETEQ()'s instability. We consider both in the next section.
6 Instability of CoMMSETEQ() and vT(V, 5)
As described in the previous sections, once VCp learns COMMSETEQ(CommSetp()) it switches
control to Mc?; switching control from MCp to VCp is based on detecting VT(VIewp(), S).
In both cases, the relevant property is not stable --H it may become false after holding
along some cut. Let switch(vC, V') be the message announcing the new view, V', and
switch(MC, Survives()) be the message announcing termination of view V
Since neither COMMSETEQ(S) nor vT(V? S) are stable properties, we can arrive at the fol-
lowing situation9
o+ Take p, q Ei V such that p and q believe each other accessible, and let `cbe their mutual
vc coordinator (? = vc-Coord(p) = vc-Coord(q)). Suppose vc? determines the new
view, V' (?, p, q E V'), sends switch(vC, V') to p only, and then crashes. VCp, upon
receiving switch(vC, V'), adopts View?() = V' and switches control to MCp in normal
mode.
o+ Now suppose that in addition to VCq not getting switch(Vc, V'), FSq notifies VCq that
Ic is inaccessible; q continues executing in VCq waiting for some new coordinator ?` to
9While we illustrate instability with CoMMSETEQ() and the switch from vc? to MOp, a similar situation
arises for vT(V? S) as well.
16
inform it of the new view. In particular, suppose Ic =
o+ Since p and q continue to believe each other accessible, FSq gossips not-coinii(?) to FSp.
At this point, MCp enters view-termination mode for view View?() = V', and q is still
executing in VCq waiting to receive the successor view to V. Observe that unless one
of the processes crashes or a network partition splits them, p and q need never believe
each other inaccessible.
o+ For VCq to make progress, its coordinator VCp must tell it some new view. Unfor-
tunately, VOp cannot begin executing until MCp leaves view-termination mode. MCp
cannot leave view-termination mode until it receives Survivesq() from MCq (after all,
q E V' and q E CommSet?()). In other words, p and q are deadlocked because their
execution controls are out of phase. The control discrepancy prevents either one (VCq
or MOp) from making progress until one of them believes the other inaccessible --H q is
stuck in VOq, and p is stuck in MOp.
While processes being out of phase is not always destructive, and in fact is quite natural
whenever partitions occur, it is destructive in this case since it induces deadlock. The
following precludes deadlock.
6.1 Component-Switch Protocol
Let `: be shorthand for vc-Coord(p) when VCp is executing. We describe the protocol only
for the switch from VOp to MO?; the situation is analogous for the reverse switch. Let
V = View?(). We define the following concepts as depicted in Figure 2:
o+ From Section 4, each accessibility notification from FSp forces VOp to inform its coor-
dinator VO? of the change to CommSet?(). Let VC-alert?() denote the message VCp
sends to VO? to inform VO? of the change to CommSet?().
o+ Let FS-VCNotifyp(V') be the set of not-comm(q) and comm(r) accessibilitynotifications
VOp received from FSp after sending its first CommSet?() to any coordinator and before
receiving switch(VC, V') from VO?;
So given V' and FS- VC-Notifyp(V'), VCp can infer which VC-alerI?() messages reached VO?
before it detected CoMMSETEQ(CommSet?()) and which did not. Let FS-VC-Late? be the
subset of FS- VC-Notifyp(V') for which the corresponding VC-alert?() message did not reach
VO?.
17
FSp
not-comm(r)
comm(s)
not-comm(q)
VCp
vc?
CommSet?()
I VC-aThrt?(--Hr)
VC-a1er1?(+s)
COMMSETEQ(CommSet?())
switch(vc, V')
VC-aler4(--Hq)
Figure 2: FS- VC?Notifyp(v') (tightly-shaded rectangle), VC-atert?(), FS- VC-Late? (darkly-
shaded rectangle)
18
The Component Switch protocol for VCp is:
1. The coordinator `: sends the switch(vc, V') message using a best effort reliable multi-
cast [14] (a process receiving the message reissues it to all the destination processes).
2. Upon receiving switch(VC, V'), vc?:
(a) logically reorders it to be before VCp sent any of the messages in FS- VO-Late,, (this
will be clearer after 3);
(b) installs V' as View?() and switches control to MCp, in normal mode;
3. MCp handles messages in FS- VC-Late? as if the corresponding notifications from FSp had
just arrived (i.e. while MCp is executing, and not while VCp was executing). Specifically,
MCp simulates receiving these accessibility notifications in View?() = V'.
Proposition 6.1 The Component-Switch Protocol prevents deadlock.
PROOF (sketch) We restrict this discussion to a process p, in view V, switching from its VCp
to MCp component, and suppose q E V. Suppose q never switches from VCq to MCp in view
V We show this does not prevent p from later switching from MCp back to VCp.
Because p switches to MOp in view V, p has received the switch(vC, V) message. By the
Component-Switch Protocol, p has reissued switch(vc, V) to q. Then either:
1. q never receives switch(VC, V), or
2. q receives switch(VC, V) after having already switched to MCq in view V', with V' $ V.
In the first case, NOTCOMMq(p) holds eventually. In the second, p e V' contradicts p ?
V. Thus, NOTCOMMq(p) holds, and FSp Reciprocity means eventually either p crashes or
NcTCOMM?(q) holds. Once NOTCOMM?(q) holds, p's progress (i.e. switching back to VOp)
is decoupled from q's progress; q cannot be responsible for blocking p. I
7 Concluding Remarks
This paper has shown how to implement virtually-synchronous communication using a three-
component architecture for systems that experiences process crash failures and network par-
titions. The three-component architecture lead us to define a clear semantics for a Failure
19
Suspector (a necessary part of any live, asynchronous system) that guarantees liveness of the
VC and MC components. Clearly defining these semantics allows one to implement the Fail-
ure Suspector as a modular tool --H distinct from all other components --H whose implementation
can take advantage of the characteristics of the underlying network.
Considering a membership service in relation to virtually-synchronous communication also
lead us to better understand the need for a strong-partial compared to a weak-partial member-
ship service. Specifically, a strong-partial membership service (non-intersecting concurrent
views) is naturally related to virtually-synchronous communication. We can understand this
in the following way. The MC component must identify the sender of a message by its signa-
ture ? to ensure that no multicast is delivered in more than one view. This led us to define
a view as a set of process signatures. Considering the increment conditions of seq?, two dif-
ferent views V and V' trivially have a non-empty intersection. In other words, by requiring
that no multicast be delivered in more than one view, we were led to the partial-strong mem-
bership service. llowever if we remove the MC component, (i.e. if the membership service
is only defined by FS and vc, without any reference to communication), then the sequence
number seq? has no clear justification. In that case, a view is just a set of process identifiers
(or a set of identifiers and an incarnation number). With this definition, the same VC proto-
col we described would define concurrent views that overlap, providing only a weak-partial
membership service.
References
[1]
[2]
[3]
A. El Abbadi, D. Skeen, and F. Cristian. An Efficient, Fault-Tolerant Algorithm for
Replicated Data Management. In Proceedings of the 5th ACM SIGACT-SIGMOD Sym-
posium on the Principles of Database Systems, pages 215--H229. A.C.M., 1985.
Y. Amir, D. Dolev, 5. Kramer, and D. Malki. Membership Algorithms in Broadcast
Domains. In A. Segall and 5. Zaks, editors, Proceedings of the Sixth WDAC; Israel,
pages 292--H312. Springer-Verlag, 1992. LNCS 647.
K. Birman and T. Joseph. Exploiting Virtual Synchrony in Distributed systems. In
Proceedings of the 11th Symposium on Operating System Principles, pages 123--H138,
November 1987.
[4] K. Birman, A. Schiper, and P. Stephenson. Lightweight Causal and Atomic Group
Multicast. ACM Transactions on Computer Systems, 9(3):272--H314, 1991.
[5] K.P. Birman. The Process Group Approach to Reliable Distributed Computing. Tech-
nical Report TR-91-1216, Cornell University, July 1991.
20
[6]
[7]
T. D. Chandra and V. Hadzilacos andS. Toueg. T Weakest Failure Detector for Solving
Consensus. In Proceedings of the 11th Annual A.C.M. Symposium on Principles of
Distributed Computing, pages 147--H158. ACM, August 1992.
T. D. Chandra and 5. Toueg. Unreliable Failure Detectors for Asynchronous Systems.
In Proceedings of the Tenth Annual A.C.M. Symposium on Principles of Distributed
Computing, pages 325--H340. ACM, August 1991.
[8] M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of
Distributed Systems. A.C.M. Transactions on Computer Systems, 3(1):63--H75, 1985.
[9] D. Davcev and W. A. Burkhard. Consistency and Recovery Control for Replicated
Files. In Proceedings of the 10th Symposium on Operating System Principles, pages
87--H96, 1985.
[10] M. J. Fischer, N. A. Lynch, and M. 5. Paterson. Impossibility of Distributed Consen-
sus with One Faulty Process. Journal of the Association for Computing Machinery,
32(2):374--H382, April 1985.
[11] D. K. Gifford. Weighted Voting for Replicated Data. In Proceedings of the 7th Sympo-
s?um on Operating System Principles, pages 150--H159, December 1979.
[12] R. A. Golding. Weak consistency group communication for wide-area systems. In
Proceedings of the 2nd IEEE Workshop on the Management of Replicated Data, pages
13--H16, November 1992.
[13]
F. Jahanian and W. M. Moran. Strong, Weak and Hybrid Group Membership. In
Proceedings of the 2nd lEFE Workshop on the Management of Replicated Data, pages
34--H38, November 1992.
[14] T. Joseph and K. Birman. Distributed Systems, chapter Reliable Broadcast Protocols,
pages 293--H317. Addison-Wesley, 1989.
[15] L. Lamport. Time, Clocks and the Ordering of Events in a Distributed System. Com-
munications of the A.C.M., 21(7):558--H565, 1978.
[16]
K. Marzullo and G. Neiger. Detection of Global State Predicates. In Proceedings fo the
Fifth International WDAC, pages 254--H272. Springer-Verlag (LNCS 579), 1991. Delphi,
Greece.
[17] P. M. Melliar-Smith, L. E. Moser, and V. Agrawala. Membership Algorithms for Asyn-
chronous Distributed Systems. In Proceedings of the IEEE 11th ICDCS, pages 480--H488,
May 1991.
5. Mishra, L. L. Peterson, and R. D. Schlichting. A Membership Protocol Based on
Partial Order. In Proceedings of the lEFE International Working Conf on Dependable
Computing for Critical Applications, pages 137--H145, February 1991.
[18]
21
[19] A. Ricciardi and K. Birman. Using Process Groups to Implement F'ailure Detection in
Asynchronous Environments. In Procedings of the Tenth Annual A.C.M. Symposium
on Principles of Distributed Computing, pages 341--H351. A.C.M., August 1991.
[20] A. M. Ricciardi. The Asynchronous Membership Problem. PhD thesis, Cornell Univer-
sity, January 1993.
[21]
A. M. Ricciardi, K. P. Birman, and P. Stephenson. The Cost of Order in Asynchronous
Systems. In A. Segall and 5. Zaks, editors, Proceedings of the Sixth WDAC; Israel,
pages 341--H352. Springer-Verlag, 1992. LNCS 647
[22] A. Schiper and A. Sandoz. Uniform Reliable Multicast in a Virtually Synchronous
Environment. In Proceedings of the IEEE 13th ICDCS, May 1993.
22
