BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR92-1293
ENTRY:: 1994-06-28
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
NOTES:: Replaced by 94-1426
END:: CORNELLCS//TR92-1293
BODY::
The Weakest Failure Detector
tor Solving Consensus*
Tushar Deepak Chandra**+
Vassos Hadzilacos***
Sam Toueg+
TR 92-1293
July1992
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*Research supported by NSF grants CCR-8901780 and CCR-9102231,
DARPA/NASA Ames grant NAG 2-593, grants from the IBM Endicott Programming
Laboratory and Siemens Corp., and a grant from the Natural Sciences and
Engineering Research Council of Canada.
**Also supported by an IBM graduate fellowship.
***Computer Systems Research Institute, University of Toronto, 6 King's Colege
Road, Toronto, Ontario M5S 1 Al.
+Dept. of Computer Science, Upson Hall, Cornell University, Ithaca, NY 14853.
The Weakest Failure Detector for Solving
Consensus*
Tushar Deepak Chandrat
Vassos Hadzilacost
July 18,1992
Abstract
Sam Toueg
We determine what information about failures is necessary and sufficient to
solve Consensus in asynchronous distributed systems subject to crash failures. In
[CT9l], we proved that ?W, a failure detector that provides surprisingly little
information about which processes have crashed, is sufficient to solve Consensus
in asynchronous systems with a majority of correct processes. Iii this paper, we
prove that to solve Consensus, any failure detector has to provide at least as much
information as ?W. Thus, ?W is indeed the weakest failure detector for solving
Consensus in asynchronous systems with a majority of correct processes.
1 Introduction
1.1 Background
The asynchronous model of distributed computing has been extensively studied. Infor-
mally, an asynchronous distributed system is one in which message transmission times
and relative processor speeds are both unbounded. Thus an algorithm designed for
an asynchronous system does not rely on such bounds for its correctness. In practice,
asynchrony is introduced by unpredictable loads on the system.
Although the asynchronous model of computation is attractive for the reasons out-
lined above, it is well-known that many fundamental problems of fault-tolerant dis-
tributed computing that are solvable in synchronous systems, are unsolvable in asyn-
chronous systems. In particular, it is well-known that Consensus, and several forms
*Research supported by NSF grants CCR-8901780 and CCR-9102231, DARPA/NASA Ames grant
NAG-2-593, grants from the IBM Endicott Programming Laboratory and Siemens Corp, and a grant
from the Natural Sciences and Engineering Research Council of Canada.
tAlso supported by an IBM graduate fellowship.
tComputer Systems Research Institute, University of Toronto, 6 King's College Road, Toronto,
Ontario M5S lAl
Dept. of Computer Science, Upson Hall, Cornell University, Ithaca, NY 14853
2
of reliable broadcast, including Atomic Broadcast, cannot be solved deterministically in
an asynchronous system that is subject to even a single crash failure [FLP85,DDS87].
Essentially, these impossibility results stem from the inherent difficulty of determining
whether a process has actually crashed or is only "very slow
To circumvent these impossibility results, previous research focused on the use of ran-
domization techniques [CD89],the definition of some weaker problems and their solutions
[DLP+86,ABD+87,BW87,BMZ88], or the study of several models of partial synchrony
[DD587,DLS88]. However, the impossibility of deterministic solutions to many agreement
problems (such as Consensus and Atomic Broadcast) remains a major obstacle to the
use of the asynchronous model of computation for fault-tolerant distributed computing.
An alternative approach to circumvent such impossibility results is to augment the
asynchronous model of computation with a failure detector. Informally, a failure detector
is a distributed oracle that gives (possibly incorrect) hints about which processes may
have crashed so far: Each process has access to a local failure detector module that
monitors other processes in the system, and maintains a list of those that it currently
suspects to have crashed. Each process periodically consults its failure detector module,
and uses the list of suspects returned in solving Consensus.
A failure detector module can make mistakes by erroneously adding processes to its
list of suspects: i.e., it can suspect that a process p has crashed even though p is still
running. If it later believes that suspecting p was a mistake, it can remove p from its list.
Thus, each module may repeatedly add and remove processes from its list of suspects.
Furthermore, at any given time the failure detector modules at two different processes
may have different lists of suspects.
It is important to note that the mistakes made by a failure detector should not prevent
any correct process from behaving according to specification. For example, consider an
algorithm that uses a failure detector to solve Atomic Broadcast in an asynchronous
system. Suppose all the failure detector modules wrongly (and permanently) suspect
that a correct process p has crashed. The Atomic Broadcast algorithm must still ensure
that p delivers the same set of messages, in the same order, as all the other correct
processes. Furthermore, if p broadcasts a message m, all correct processes must deliver
In [CT91], we showed that a surprisingly weak failure detector is sufficient to solve
Consensus and Atomic Broadcast in asynchronous systems with a majority of correct
processes. This failure detector, called the eventually weak failure detector and denoted
W here, satisfies only the following two properties:2
1. There is a time after which every process that crashes is always suspected by some
correct process.
`A different approach was taken in [RB91]: a correct process that is wrongly suspected to have
crashed, voluntarily leaves the system. It may later rejoin the system by assuming a new identity.
2? [CT9l], this was denoted ?W.
3
2. There is a time after which some correct process is never suspected by any correct
process.
Note that, at any given time t, processes cannot use W to determine the identity of a
correct process. Furthermore, they cannot determine whether there is a correct process
that will not be suspected after time t.
The failure detector )`V can make an infinite number of mistakes. In fact, it can forever
add and then remove some correct processes from the lists of suspects (this reflects
the inherent difficulty of determining whether a process is just slow or has crashed).
Moreover, some correct processes may be erroneously suspected to have crashed by all
the other processes throughout the entire execution.
The two properties of W state that eventually something must hold forever; this
may appear too strong a requirement to implement in practice. However, when solving a
problem that "terminates", such as Consensus, it is not really required that the properties
hold forever, but merely that they hold for a sufficiently long time, i.e., long enough for
the algorithm that uses the failure detector to achieve its goal. For instance, in practice
the algorithm of [CT9lj that solves Consensus using W only needs the two properties of
W to hold for a relatively short period of time.3 However, in an asynchronous system it
is not possible to quantify "sufficiently long", since even a single process step or a single
message transmission is allowed to take an arbitrarily long amount of time. Thus it is
convenient to state the properties of W in the stronger form given above.
1.2 The problem
The failure detection properties of)'V are sufficient to solve Consensus in asynchronous
systems. But are they necessary? For example, consider failure detector A that satisfies
Property 1 of W and the following weakening of Property 2:
There is a time after which some correct process is never suspected by at
least 99% of the correct processes.
A is clearly weaker than W. Is it possible to solve Consensus using A? Indeed what
is the weakest failure detector sufficient to solve Consensus in asynchronous systems?
In trying to answer this fundamental question we run into a problem. Consider failure
detector B that satisfies the following two properties:
1. There is a time after which every process that crashes is always suspected by all
correct processes.
2. There is a time after which some correct process is never suspected by a majority
of the processes.
3'n that algorithm processes are cyclically elected as "coordinators". Consensus is achieved as soon
as a correct coordinator is reached, and no process suspects it to have crashed while this coordinator is
trying to enforce consensus.
4
It seems that 13 and W are incomparable: 13's first property is stronger than )?V's, and 13's
second property is weaker than W's. Is it possible to solve Consensus in an asynchronous
system using 13? The answer turns out to be "yes" (provided that this asynchronous
system has a majority of correct processes, as W also requires). Since W and 13 appear
to be incomparable, one may be tempted to conclude that W cannot be the "weakest"
failure detector with which Consensus is solvable. Even worse, it raises the possibility
that no such "weakest" failure detector exists.
However, a closer examination reveals that 13 and W are indeed comparable in a
natural way: There is a distributed algorithm Ts?w that can transform 13 into a failure
detector with the Properties 1 and 2 of W. TB?w works for any asynchronous system
that has a majority of correct processes. We say that W is reducible to 13 in such a
system. Since TB?w is able to transform 13 into ? in an asynchronous system, 13 must
provide at least as much information about process failures as W does. Intuitively, 13 is
at least as strong as
1.3 The result
In [CT91j, we showed that W is sufficient to solve Consensus in asynchronous systems
if and only if n> 2f (where n is the total number of processes, and f is the maximum
number of processes that may crash). In this paper, we prove that W is reducible
to any failure detector D that can be used to solve Consensus (this result holds for
any asynchronous system). We show this reduction by giving a distributed algorithm
TD?w that transforms any such D into W. Therefore, Wis indeed the weakest failure
detector that can be used to solve Consensus in asynchronous systems with n > 2f.
Furthermore if n ? 2f, any failure detector that can be used to solve Consensus must
be strictly stronger than W.
The task of transforming any given fallure detector D (that can be used to solve
Consensus) into )`V runs into a serious technical difficulty for the following reasons:
To strengthen our result, we do not restrict the output of D to lists of suspects.
Instead, this output can be any value that encodes some information about failures.
For example, a failure detector V should be allowed to output any boolean formula,
such as "(not p) and (q or r)" (i.e., p is up and either q or r has crashed)--Hor any
encoding of such a formula. Indeed, the output of V could be an arbitrarily complex
(and unknown) encoding of failure information. Our transformation from V into
W must be able to decode this information.
Even if the failure information provided by V is not encoded, it is not clear how to
extract from it the failure detection properties of )`V. Consequently, if V is given
in isolation, the task of transforming it into W may not be possible.
Fortunately, since V can be used to solve Consensus, there is a corresponding algorithm,
Consensusp, that is somehow able to "decode" the information about failures provided
5
by D, and knows how to use it to solve Consensus. Our reduction algorithm, TD?w uses
Consensu? to extract this information from D and transforms it into the properties of
A,.
2 The model
We describe a model of asynchronous computation with failure detection patterned after
the one in [FLP85J.
2.1 Failure Detectors
We assume the existence of a discrete global clock to simplify the presentation. This is
merely a fictional device: the processes do not have access to it. We take the range T of
the clock's ticks to be the set of natural numbers.
The system consists of a set of n processes, II = fpi,p2,... ,pnl, that may fail by
crashing. A failure pattern F is a function from T to 2?, where F(t) denotes the set
of processes that have crashed through time 1. Once a process crashes, it does not
"recover", i.e., Vt : F(t) C F(t + 1). We define crashed(F) = UtET F(t) and correct(F) =
II --H crashed(F). If p ? crashed(F) we say p crashes in F and if p ? correct(F) we say p
ts correct in F.
Associated with each failure detector is a range ? of values output by that failure
detector. A failure detector history H with range 1? is a function from fix T to 1?. H(p, t)
is the value of the failure detector module of process p at time t. A failure detector D
is a function that maps each failure pattern F to a set of failure detector histories with
range ?? (where 1?D denotes the range of failure detector outputs of D). D(F) denotes
the set of possible failure detector histories permitted by D for the failure pattern F.
For example, consider the failure detector A, mentioned in the introduction. Each
failure detector module of A, outputs a set of processes that are suspected to have
crashed: in this case ?? = 211. For each failure pattern F, ?(F) is the set of all failure
detector histories Hw with range l?w that satisfy the following properties:
1 There is a time after which every process that crashes in F is always suspected by
some process that is correct in F:
E T, Vp E crashed(F), ?q ? correct(F), Vt' > t : p E Hw(q, t')
2. There is a time after which some process that is correct in F is never suspected by
any process that is correct in F:
T, ?p ? correct(F), Vq ? correct(F), Vt' > t : p ? Hw(q, t')
6
Note that we specify a failure detector D as a function of the failure pattern F of
an execution. However, this does not preclude an implementation of D from using other
aspects of the execution such as when messages are received. Thus, executions with the
same failure pattern F may still have different failure detector histories. It is for this
reason that we allow D(F) to be a set of failure detector histories from which the actual
failure detector history for a particular execution is selected non-deterministically.
2.2 Algorithms
We model the asynchronous communication channels as a message buffer which contains
messages of the form (p, data, q) indicating that process p has sent data addressed to
process q and q has not yet received that message. An algorithm A is a collection of
n (possibly infinite state) deterministic automata, one for each of the processes. A(p)
denotes the automaton running on process p. Computation proceeds in steps of the given
algorithm A. In each step of A, process p performs atomically the following three phases:
Receive phase: p receives a single message of the form (q, data, p) from the message
buffer, or a "null" message, denoted A, meaning that no message is received by p
during this step.
Failure detector query phase: p queries and receives a value from its failure detector
module. We say that p sees a value d when the value returned by p's failure detector
module is d.
Send phase: p changes its state and sends a message to all the processes according to
the automaton A(p), based on its state at the beginning of the step, the message
received in the receive phase, and the value that p sees in the failure detector query
phase.4
The message actually received by the process p in the receive phase is chosen non-
deterministically from amongst the messages in the message buffer destined to p, and
the null message A. The null message may be received even if there are messages in the
message buffer that are destined to p: the fact that m is in the message buffer merely
indicates that m was sent to p. Since ours will be a model of asynchronous systems,
where messages may experience arbitrary (but finite) delays, the amount of time m
may remaill ill the message buffer before it is received is unbounded. Indeed, our
model will allow a message sent later than another to be received earlier than the other.
Though message delays are arbitrary, we also want them to be finite. We model this
by introducing a liveness assumption: every message sent will eventually be received,
4'n the send phase, p sends a message to all the processes atomically. As was shown in [FLP85],
the ability to do so is not sufficient for solving Consensus. An alter?ative formulation of a step could
restrict a process to sending a message to a single process in the send phase. We can show that both
formulations are equivalent for our purposes.
7
provided its recipient makes "sufficiently many" attempts to receive messages. All this
will be made more precise later.
We also remark that the non-determinism arising from the choice of the message to be
received reflects the asynchrony of the message buffer --H it is not due to non-deterministic
choices made by the process. The automaton A(p) is deterministic in the sense that the
message that p sends in a step and p's new state are uniquely determined from the
present state of p, the message p received during the step and the failure detector value
seen by p during the step.
To keep things simple we assume that a process p sends a message m to q at most
once. This allows us to speak of the contents of the message buffer as a set, rather than
a multiset. We can easily enforce this by adding a counter to each message sent by p to
q --H so this assumption does not damage generality.
2.3 Configurations, Runs and Environments
A configuration is a pair (s, M), where $ is a function mapping each process p to its local
state, and M is a set of triples of the form (q, data, p) representing the messages presently
in the message buffer. An initial configuration of an algorithm A is a configuration (s, M),
where s(p) is an initial state of A(p) and M = Q). A step of a given algorithm A transforms
one configuration to another. A step of A is uniquely determined by the identity of the
process p that takes the step, the message m received by p during that step, and the
failure detector value d seen by p during the step. Thus, we identify a step of A with a
tuple (p, m, d, A). If the message received in that step is the null message, then m = A,
otherwise m is of the type (--H, --H,p). We say that a step e = (p,m,d,A) is applicable to
a configuration C = (s, M) if and only if m E M u fAl. We write e(C) to denote the
unique configuration that results when e is applied to C.
A schedule 5 of algorithm A is a finite or infinite sequence of steps of A. 5j denotes
the empty schedule. We say that a schedule 5 of an algorithm A is applicable to a
configuration C if and only if (a) 5 = 5j, or (b) 5[1] is applicable to C, 5[2] is applicable
to 5[1](C), etc.5 If 5 is a finite schedule applicable to C, 5(C) denotes the unique
configuration that results from applying 5 to C. Note 51(C) = C for all configurations
C. We say that C' is a configuration of (5, C) if there is a prefix 5' of 5 such that
C1 =
A partial run of algorithm A using a failure detector D is a tuple R = (F, HD, 1, 5, T?
where F is a failure pattern, HD ? D(F) is a failure detector history, 1 is an initial
configuration of A, 5 is a finite schedule of A, and T is a finite list of increasing time
values (indicating when each step in 5 occurred) such that 151 = IT, 5 is applicable to
1, and for all i < 151, if 5[ij is of the form (p,m,d,A) then:
o+ p has not crashed by time T[ij, i.e., p ? F(T[iJ)
o+ d is the value of the failure detector module of p at time T[i], i.e., d = Hv(p, T[i])
5We denote by v(i] the ith element of a sequence v.
8
Informally, a partial run of A using D represents a finite point of some execution of A
using D.
A run of an algorithm A using a failure detector D is a tuple R = (F, HD, 1, 5, T?
where F is a failure pattern, Hv ? D(F) is a failure detector history, 1 is an initial
configuration of A, 5 is an infinite schedule of A, and T is an infinite list of increasing
time values indicating when each step in 5 occurred. In addition to satisfying the above
properties of a partial run, a run must also satisfy the following properties:
o+ Every correct process takes an infinite number of steps in 5. Formally:
Vp ? correct(F), Vi, ?j > i : S[j] is of the type(p, --H, --H, A)
o+ Every message sent to a correct process is eventually received. Formally:
Vp ? correct(F),VC = (s,M) of (5, I) : m = (q,data,p) ? M ?
: 5[i] is of the type (p, m,
In [CT9l], we proved that any algorithm that uses W to solve Consensus requires
n > 2f. With other failure detectors the requirements may be different. For example,
there is a failure detector that can be used to solve Consensus only ifP1 and P2 do not both
crash. In general whether a given failure detector can be used to solve Consensus depends
upon assumptions about the underlying "environment". Formally, an environment ? (of
an asynchronous system) is set of possible failure patterns.6
3 The Consensus problem
In the Consensus problem, each process p has an initial value, 0 or 1, and must reach an
irrevocable decision on one of these values. Thus, the algorithm of process p, A(p), has
two distinct initial states a0? and a?1 signifying that p's initial value is 0 or 1. A(p) also
has two disjoint sets of decision states ??0 and ???. If p enters a state in ??k, we require
that it remain in states in ??k, and we say that p has decided k.
We say that algorithm A uses failure detector V to solve Consensus in environment
? if every run R = (F, Hv, I, 5, T? of A using V where F ? C satisfies:
Termination: Each correct process eventually decides. Formally:
Vp ? correct(F), ?C = (s, M) of (5,1): s(p) ? ?P0 ?
a synchronous system, assumptions about the underlying environment may also include other
characteristics such as the relative process speeds, the maximum message delay, the degree of clock
synchronization, etc. In such a system, a more elaborate definition of an environment would be required.
9
Validity: Each correct process decides on the initial value of some process. Formally,
let I = (so, Mo):
Vp ? correct(F), Vk ? ?O, 11: (?C = (s, M) of (S, I):
s(p) ? ??k) ? (?q ? II : so(q) =
Agreement: No two correct processes decide differently. Formally:
Vp,p' E correct(F),VC = (s,M) of(?,I),vk,k' c fo,l?
??k A s(p') ? "??k') ? k =
4 Reducibility
We now define what it means for an algorithm Tv?vt to transform a failure detector
D into another failure detector D' in an environment ?. Algorithm TD?v' uses D to
maintain a variable output? at every process p. This variable, reflected in the local state
of p, emulates the output of D' at p. Let 0R be the history of all the output variables
in run R, i.e., OR(p, t) is the value of output? at time tin run R. Algorithm TD?D'
transforms D into D' in ? if and only if for every run R = (F, HD, I, 5, T? of TD?D
using D, where F ? ?, 0R ?
Given TD?D', anything that can be done using D' in S, can be done using D instead.
To see this, suppose a given algorithm B requires failure detector D' (when it executes in
?), but only D is available. We can still execute B as follows. Concurrently with B, we
run TD?D to transform D into D'. We now modify the failure detector query phase of
each step of B at process p: p reads the current value of output? (which is concurrently
maintained by TD?DI) instead of querying its failure detector module. This is illustrated
in Fig. 1.
intuitively, since TD?D is able to use D to emulate D', D provides at least as much
information about process failures in ? as D' does. Thus, if there is an algorithm TD?Df
that transforms D into D' in ?, we write D ?? D' and say that D' is reducible to D in
we aiso say that D' is weaker than D in ?
5 An outline of the result
in [CT91] we showed that W can be used to solve Consensus in any environment in which
n > 2f. We now show that W is weaker than any failure detector that can be used to
solve Consensus. This result holds for any environment ?. Together with [CT91J, this
implies that W is indeed the weakest failure detector that can be used to solve Consensus
in any environment ill which n> 2f.
10
TD?D
Algorithm B uses
Figure 1: Transforming D into D'
To prove our result, we first define a new failure detector, denoted ?, that is at least
as strong as W. We then show that any failure detector D that can be used to solve
Consensus is at least as strong as ?. Thus, D is at least as strong as
The output of the failure detector module of ? at a process p is a single process, q,
that p currently considers to be correct; we say that p trusts q. In this case, lZ? = II.
For each failure pattern F, ?(F) is the set of all failure detector histories H? with range
?n that satisfy the following property:
o+ There is a time after which all the correct processes always trust the same correct
process:
T,?q E correct(F),Vp ? correct(F),Vt' > t : H?(p,t') = q
As with W, the output of the failure detector module of ? at a process p may change
with time, i.e., p may trust different processes at different times, Furthermore, at any
given time t, processes p and q may trust different processes.
Theorem 1: For all environments ?, ? ?6
PROOF: [Sketch] The reduction algorithm Tn?w that transforms ? into W is as follows.
Each process p periodically sets 0UtPUtp H II --H [q?, where q is the process that p currently
trusts according to ?. It is easy to see that (in any environment S) this output satisfies
the two properties of W.
Theorem 2: For all environments s, if a failure detector D can be used to solve
11
Consensus in ?, then D ??
PROOF: The reduction algorithm Tv?n is shown in Section 6. It is the core of our result.
Corollary 3: For all environments ?, if a failure detector D can be used to solve Con-
sensus in ?, then D ??
PROOF: If D can be used to solve Consensus in ?, then, by Theorem 2 D ?? ?. From
Theorem 1, ? ?E W By transitivity, D ??
In [CT91J we proved that, for all environments ? in which n > 2f, W can be used to
solve Consensus. Together with Corollary 3, this shows that:
Theorem 4: For all environments ? in which n> 2f, W is the weakest failure detector
that can be used to solve Consensus in ?.
6 The reduction algorithm
Let ? be an environment, D be a failure detector that can be used to solve Consensus in
?, and Consensusp be the Consensus algorithm that uses D. We describe an algorithm
Tv?n that transforms D into ? in ?. Intuitively, this algorithm works as follows. Fix
an arbitrary run of TD?n using D in S, with failure pattern F e S, and failure detector
history HD E D(F). We shall first construct an infinite directed acyclic graph, denoted
G, whose vertices are some of the failure detector values that occur in HD, and whose
edges are consistent with the time at which these values occur. We then show that G
induces a simulation forest T that encodes an infinite set of possible runs of Consensusp.
Finally, we show how to extract from T the identity of a process p* that is correct in F.
The induced simulation forest is infinite and thus cannot be computed by any process.
However, the information needed to extract p* is present in a finite subgraph of the
forest. It will be sufficient for each correct process p to construct ever increasing finite
approximations of the simulation forest T that will eventually include this crucial finite
subgraph. At all times, p uses its present approximation of T to select the identity
of some process: once p's approximation of T includes the crucial finite subgraph, the
selected process will be p* (forever). Thus, there is a time after which all correct processes
trust the same correct process, p*--Hwhich is exactly what ? requires.
We say that a process is correct(crashes) if it is correct (crashes) in F. For simplicity,
we assume that a process psees a value d at most once (this can be enforced by tagging
a counter to each value seen). For the rest of this paper, whenever we refer to a run
of Consensu?, we mean a run of Consensu? using D. Furthermore, we only consider
schedules of Consensu?,and therefore we write (p,m,d)instead of (p1m,d, Consensusp)
to denote a step.
12
6.1 A DAG and a forest
Given the failure pattern F and the corresponding failure detector history HD E D(F)
that were fixed above, let G be any infinite directed acyclic graph with the following
properties:
1. The vertices of G are of the form [p,d] where p ? fl and d c ?D If Wdl is a
vertex of G, then there is a time t such that p ? F(t) and d = HD(p,t) (i.e., at
time t, p has not crashed and the value of ps failure detector module is d).
2. If [qi, d1] [q?, d2] is an edge of G and d1 = H?(qi, t1) and d2 = Hv(q2, t2) then
ti ? t2.
3. G is transitively closed.
4.
Let p be any correct process and V be a finite subset of vertices of G. There is a
failure detector value d such that for all vertices [p',d'j in V, [p',d'j [p,d] is an
edgeofG.
Note that such a DAG represents only a "sampling" of the failure detector values that
occur in Hv. In particular, we do not require that it contain all the values that occur in
HD or that it relate (with an edge) all the values according to the time at which they
occur. However, Property 4 implies that the DAG contains infinitely many "samplings"
of the failure detector module of each correct process.
Lemma 5: Let V be any finite subset of vertices in G. G has an infinite path g such
that:
o+ There is an edge from every vertex of V to the first vertex of g.
o+ If [p,--H] is a vertex of g then p is correct; for each correct p, there are infinitely
many vertices [p,--H] in g.
PROOF: By repeated application of Property 4.
Let g = [q1,d1],[q2,d2],... be any (finite or infinite) path of G. A schedule S is
compatible with g if it has the same length as g, and S = (q1,m1,di),(q2,m2,d2),...,
for some (possibly null) messages m1, m2,. We say that S is compatible with G if it is
compatible with some path of G.
Let 1 be any initial configuration of Consensusp. We define the simulation tree Yk
induced by G and I as follows. The vertices of 1G1 are the finite schedules 5 that are
compatible with G and are applicable to I. The root of ?1G is the empty schedule S?.
There is an edge from vertex 5 to vertex 5' if and only if 5' = 5. e for a step e;7 this edge
is labeled e. With each (finite or infinite) path in `1G' we associate the unique schedule
7ff u,w are sequences and v is finite then v w denotes the concatenation of the two sequences.
13
5 = e1, ..... . , ...... consisting of the sequence of labels of the edges on that path. Note
that if a path starts from the root of ?G1 and it is finite, the schedule 5 associated with
it is also the last vertex of that path.
Lemma 6: 5 is a schedule associated with a path of ?G1 that starts from the root if
and only if 5 is a schedule compatible with & and applicable to I.
PROOF: The lemma obviously holds if 5 is a finite schedule (this is immediate from the
definitions). Now let 5 = e1,e2,... ....... be an infinite schedule, where e? = [q?, m?, di].
We define So = Sj, Si = e1, 52 = e2, and in general Si = S?-i e? for all i =1,2,...
Assume that 5 is compatible with & and applicable to I. We must show that 5 is a
schedule associated with a path of ?G1 that starts from the root. To see this, note that
for all i > 0, S? is a finite schedule that is also compatible with & and applicable to I.
Thus, all the schedules 5o, Si,5.... , 5??i,S.?... are vertices of T'0. Since S? = 5i--Hi
the edge from 5i--Hi to S? is labeled e?, for all i > 1. Thus, 5 = e1,e2,... ,e?,... is the
schedule associated with the infinite path 5o			5i			52			...			5i-i			S?			... of
T01; this path starts from the root So = 5j.
Assume that 5 is a schedule associated with an infinite path of ?1G that starts from
the root. We must show that 5 is compatible with & and is applicable to I. First
note that for all i, S? is a vertex in T?,thus S? is compatible with & and is applicable
to I. Since S? = [q1,m1,d1],[q2,m2,d2],... , [qi, ?i, di] is compatible with &, & must
contain the path r? = [qi, d1], [q?, d2], . . . , [qi, di] (for all i). Note that, for all i, 7risi =
iri [?i+i, di+i] is an extension of the path r? in &. Therefore, & contains the infinite path
[qi,di],[q2,d2],... ,[qi,di],... So 5 is compatible with &. Furthermore, since all S?'s are
applicable to I, by definition of applicability, the infinite schedule 5 is also applicable to
I.			Thus, 5 is compatible with & and applicable to I.			[z
The following two lemmata show that the finite and infinite paths of ?G1 correspond
to partial runs and runs of Consensusp with initial configuration I.
Lemma 7: Let 5 be a schedule associated with a finite path of ?`c that starts from
the root. There is a sequence of times T such that (F, HD, I, 5, T) is a partial run of
&onsensusp.
PROOF: By Lemma 6, 5 is applicable to I and compatible with &. Thus 5 is compatible
with some finite path g = [q1,di], [q2,d2],. .. , [q?,di],. . ., [q?,d?] of &. From Property 1
of & (applied to every vertex of the path g), there is a sequence T = ti,t2,. . . ,ti,... ,tk
of times such that for all i, 1 < i < k, dj = Hv(q?,t?) and q? ? F(tj). From Property
2 of & (applied to every edge of the path g), for all i, 1 < i ? k, ti < ti+i. Thus T
is a sequence of increasing times, and, by definition, (F, Hv, I, 5, T? is a partial run of
Consensu?.
Lemma 8: Let 5 be a schedule associated with an infinite path of ?1G that starts from
the root. if in 5 every correct process takes an infinite number of steps and every mes-
sage sent to a correct process is eventually received, there is a sequence of times T such
14
that (F, HD, I, 5' T? is a run of Consensu?.
PROOF: Similar to Lemma 7.
The following lemmata show some "richness" properties of the simulation trees induced
by G.
Lemma 9: For any two initial configurations I and I', if 5 is a vertex of ?`c and is
applicable to I' then 5 is also a vertex of ?G1,
PROOF: Follows directly from the definitions.
Lemma 10: Let 5 be any vertex of T01 and p be any correct process. Let m be a
message in the message buffer of 5(1) addressed to p or the null message. For some
5 has a child 5 (p,rn,d) in T?1
PROOF: From the definition of ?c1, 5 is compatible with some finite path g of G and
applicable to I. Let v denote the last vertex of g. By Property 4, there is a d such
that v [p,d] is an edge of G. Therefore, g [p,d] is a path of G, and 5 (p, m, d) is
compatible with G.
It remains to show that 5 (p, m, d) is applicable to I. Since 5 is applicable to I, it
suffices to show that (p, m, d) is applicable to 5(1). But this is true since, by hypothesis,
m is in the message buffer of 5(1) and addressed to p, or the null message. E]
Lemma 11: Let 5 be any vertex of Tk and p be any process. Let m be a message in
the message buffer of 5(1) addressed to p or the null message. Let 5' be a descendent
of 5 such that, for some d, 5'. (p, m, d) is in T10. For each vertex 5" on the path from
5 to 5' (inclusive), 5" (p, m, d) is also in ?G1
PROOF: Since they are vertices of Tc', 5, 5" and 5' (p, m, d) are compatible with some
finite paths g, g g" and g g" g' [p,d] of G, respectively. From Property 3 (transitive
closure) of G, g g" [p,d] is also a path of G. So 5" (p, m, d) is compatible with this
path of G. We now show that 5". (p, m, d) is also applicable to I, and therefore it is a
vertex of ?c1'
Since 5" is a vertex of ?1a, 5" is applicable to I. If m = A, then (p, md) is obviously
applicable to 5"(I). Now suppose m # A. Since 5'. (pm, d) is a vertex of ?c1, (pm, d)
is applicable to 5'(I), and thus m is in the message buffer of S'(I). Since each message
is sent at most once and m is in the message buffers of 5(1) and 5'(I), there is no edge
of the type (p, m, --H) on the path from 5 to 5'. So m is also in the message buffer of
5"(I), and (p, m, d) is applicable to 5"(I).
Lemma 12: Let 5, 5o, and Si be any vertices of ?G1 There is a finite schedule E
containing only steps of correct processes such that:
1. 5 E is a vertex of Tc1 and all correct processes have decided in 5
2. Fori = 0,1, ifE is applicable to S?(I) then S? E is avertex ofYG1.
15
j&O
5 ?5? is compatibTh with g and applicable to I)
repeat forever
j&j+1
Let [q?, d1] be the j?th vertex of path g00
Let m? be the oldest message addressed to Q in the message buffer of Si--H'(I)
(if no such message exists, m? A)
5j--H1 e? f5? is compatible with g [q1, di? ... , [q?, dil and applicable to I)
Figure 2: Generating schedule 5. E00, compatible with path g g00, in Tk
PROOF: Since 5 is a vertex of Tk. 5 is compatible with some finite path g of G and
is applicable to I. Similarly, 5o and Si are compatible with some finite path 9o and g1,
respectively, of G. From Lemma 5 (applied to the last vertices of g, go and g?), G has
an infinite path g? = [q1,d1],?2,d2J,.. . , [q?, dj],... with the following two properties:
1. There is an edge from the last vertex of g,go and g1 to the first vertex of g?. (Thus,
g g?. 9o :900, and :9i :9oc are infinite paths in G.)
2. If [p,--H] is a vertex of :900 then p is correct; for each correct p, there are infinitely
many vertices [p,--H] in :900
We now show how to construct the required schedule E. Consider the infinite sequence
of schedules 50,51, 52,???, ...... constructed by the algorithm in Figure 2. An easy
induction shows that for all j > 0, 5? is applicable to 1 and is compatible with g [q1, d1]
a prefix of the path :9 :900 in G. So, for all j > 0, S? is a vertex of
Consider the infinite path of ?1G that starts from the root of Yk then goes to 50 = 5,
and then to 51, .2...., 5',... The infinite schedule associated with that path is 500 =
5. e1 e2 .... .... . Note that schedule E00 = e1 e2 ... .?... is compatible with path
:900 of G. By Property (2) of path g00, every correct process p takes an infinite number
of steps in E00 (and thus also in 500 = 5. E00). Since in each one of these steps p
receives the oldest message that is addressed to it, every message sent to p (in 500) is
eventually received. By Lemma 8, there is a T such that R = (F, HD, 1,500, T) is a run
of Consensu?.
From the termination requirement of Consensus, 500 has a finite prefix 5d such that
all correct processes have decided in 5d(1). There are two cases:
o+ 5d is a prefix of 5. Since decisions are irrevocable, all correct processes remain
decided in 5(1). Thus Si, the empty schedule, is the required E.
16
5 is a prefix of 5d Thus, gd = 5 E where E is a finite prefix of E?. Since E?
is compatible with g?, E is compatible with a prefix of g?. Now consider 5o (the
following argument also applies to 51) Since 5o is compatible with g0, So E is
compatible with a prefix of g0 g?, a path in G. So, So E is compatible with G.
If So E is also applicable to I, then, by the definition of ?G1 it is a vertex of ?`c
The same argument holds for Si It remains to show that E contains only steps
of correct processes. This is immediate from Property (2) of g? and from the fact
that E is compatible with a prefix of g? El
Let P, 0 ? i < n denote the initial configuration of Consensusp in which the initial
values of Pi ... are 1, and the initial values of ....... p? are 0. The simulation forest
induced by G is the set f?c? ..... . , Tk ? of simulation trees induced by G and initial
configurations 10,11)... , Ifl
6.2 Tagging the simulation forest
We assign a set of tags to each vertex of every tree in the simulation forest induced by
G. Vertex 5 of tree ?`c gets tag k if and only if it has a descendent 5' such that some
correct process has decided kin S'(I). Hereafter, T? denotes the tagged tree ?`c? and T
denotes the tagged simulation forest ?T0, ..... .
Lemma 13: Every vertex of T? has at least one tag.
PROOF: From Lemma 12, every vertex 5 of T? has a descendent 5' = 5. E (for some
E)			such that all correct processes have decided in 5'(P).			El
A vertex of T? is monovalent if it has only one tag, and bivalent if it has both tags, 0 and
1. A vertex is 0-valent if it is monovalent and is tagged O; 1-valent is similarly defined.
Lemma 14: Every vertex of T? is either O-valent, 1-valent, or bivalent.
PROOF: Immediate from Lemma 13.			El
Lemma 15: The ancestors of a bivalent vertex are bivalent. The descendents of a
valent vertex are k-valent.
PROOF: Immediate from the definitions.			El
Lemma 16: If vertex 5 of T' has tag k, then no correct process has decided 1 --H k in
5(P).
PROOF: Since 5 has tag k, it has a descendent 5' such that a correct process p has
decided k in 5'(Ii). From Lemma 7, there is a T such that R = (F,HD,Ii,5',T?
is a partial run of Consensusp. Since p has decided k in 5'(Ii), from the agreement
17
requirement of Consensus, no correct process has decided I --H k in S'(I?). Since 5' is a
descendent of 5, no correct process could have decided 1 --H k in 5(P).
Lemma 17: If vertex 5 of T? is bivalent, then no correct process has decided in 5(P).
PROOF: Immediate from Lemma 16.			El
Recall that in 10 all processes have initial value 0, while in I? they all have initial value
1.
Lemma 18: The root of T0 is 0-valent; the root of T? is 1-valent.
PROOF: We first show that the root of T0 is 0-valent. Suppose, for contradiction, that
the root of T0 has tag 1. There must be a vertex 5 of T0 such that some correct process
has decided 1 in 5(10). From Lemma 7, there is aT such that R (F,HD,10,5,T)
is a partial run of Consensu?. R violates the validity requirement of Consensus--Ha
contradiction. Thus the root of T0 cannot have a tag of 1. From Lemma 13, the root of
T0 has at least one tag: thus it is 0-valent.
By a symmetric argument, the root of T? is 1-valent. El
Index i is critical if the root of T' is bivalent, or if the root of yi--H1 is 0-valent while the
root of T? is 1-valent. In the first case, we say that index i is bivalent critical; in the
second case, we say that i is monovalent critical.
Lemma 19: There is a critical index i 0 ? i ?
PROOF: Apply Lemmata 14 and 18 to the roots of T0, ..... . ,
El
The critical index i is the key to extracting the identity of a correct process. In fact, if i
is monovalent critical, we shall prove that p? must be correct (Lemma 21). If i is bivalent
critical, the correct process will be found by focusing on the tree T?, as explained in the
following section.
6.3 Of hooks and forks
We describe two types of finite subtrees of T? referred to as decision gadgets of T1?. Each
type of decision gadget is rooted at the root 5j of T? and has exactly two leaves: one
0-valent and one 1-valent. The least common ancestor of these leaves is called the pivot.
The pivot is clearly bivalent.
The first type of decision gadget is called a fork, and is shown in Figure 3. The two
leaves are children of the pivot, obtained by applying different steps of the same process
p. Process p is the deciding process of the fork, because its step after the pivot determines
the decision of correct processes.
18
0
5 (p,m,d)
Root
0
5'
5
Figure 3: A fork--Hp is the deciding process
Root
0
5'
to'
e
(lsvot
5' = 5.(p,m,d)
Figure 4: A hook--Hp is the deciding process
5			si
repeat
19
?S? is the bivalent root of T?t
forever
Let p be the next correct process in round-robin order
Let m be the oldest message addressed to p in the message buffer of 5(li)
(if no such message exists, m = A)
if 5 has a descendent 5' such that, for some d, 5' (p, m, d) is a bivalent vertex in T%o+
then 5			5'. (p,m,d)			?5 is bivalentt
else exit
Figure 5: Generating path ir in T?
The second type of decision gadget is called a hook, and is shown in Figure 4. Let 5
be the pivot of the hook. There is a step e such that 5. e is one leaf, and the other leaf is
5 (p, m, d) e for some p, m, d. Process p is the deciding process of the hook, because the
decision of correct processes is determined by whether p takes the step (p, m, d) before
We shall prove that the deciding process p of a decision gadget must be correct
(Lemma 23). Intuitively, this is because if p crashes no process can figure out whether
p has taken the step that determines the decision value. The existence of such a critical
"hidden" step is also at the core of many impossibility proofs starting with [FLP85].
In our case, the "hiding" is more difficult because now processes have recourse to the
failure detector V. Despite this, the hiding of the step of the deciding process of a
decision gadget is still possible.
Lemma 20: If index i is bivalent critical then T' has at least one decision gadget (and
hence a deciding process).
PROOF: Starting from the bivalent root of Tt, we generate a path ir in T', all the
vertices of which are bivalent, as follows. We consider all correct processes in round-
robin fashion. Suppose we have generated path 5 so far, and it is the turn of process p.
Let m be the the oldest message destined to p that is in the message buffer of 5(1i).8 (If
no such message exists, we take m to be the null message.) We try to extend the path 5
so that the last edge in the extension corresponds to p receiving m and the target of that
edge is a bivalent vertex. The path construction ends if and when such an extension is
no longer possible. This construction is shown in Figure 5. Each iteration of the loop
extends the path by at least one edge. Let It be the path generated by these iterations;
Ir is finite or infinite depending on whether the loop terminates.
Claim 1: ir is finite.
PROOF: Suppose, for contradiction, that 7r is infinite. Let 5 be the schedule associated
8By a slight abuse of notation we identify a finite path from the root and its associated schedule.
20
0-valent
Fork (for case 1)
___________ S (bivalent)
D
5o (pivot of hook)
,d' Q5i
1-valent			`?Q s'
1-valent
Hook (for case 2)
Figure 6: The decision gadgets in T? ifi is bivalent critical
with r. By construction, in S every correct process takes an infinite number of steps and
every message sent to a correct process is eventually received. By Lemma 8, there is a
T such that R = (F, HD, I?, 5, T) is a run of Consensn?. By construction, all vertices
in Ir are bivalent. By Lemma 17, no correct process decides in R, thus violating the
termination requirement of Consensus--Ha contradiction. claim 1
Let 5 be the last vertex of Ir (clearly, 5 is bivalent). Let p be the next correct process
in round-robin order when the loop in Figure 5 terminates. Let m be the oldest message
addressed to p in the message buffer of 5(li) (if no such message exists, m is the null
message). The loop exit condition and Lemma 14 imply that
All descendents 5'. (p, m, --H) of 5 are monovalent.
From Lemma 10, for some d, 5 has a child 5 (p,m,d) in T?. By (*), 5. (p,m,d) is
monovalent. Without loss of generality, assume it is 0-valent.
Claim 2: For some d' there is adescendent 5' of5 such that 5' (p,m,d') is a 1-valent
vertex ofIi, and the path from 5 to 5' contains no edge labeled (p,m,
21
PROOF: Since 5 is bivalent, it has a descendent 5* such that some correct process has
decided 1 in 5*(1i). From Lemmata 13 and 16, 5* is 1-valent. There are two cases:
1. The path from 5 to 5* does not have an edge labeled (pm,--H). Suppose m # A.
Since m is in the message buffer of 5(li) and p does not receive m in the path
from 5 to 5*, m is still in the message buffer of 5*(1i), From Lemma 10, for some
d', 5*. (pm, d') is in T?. Since 5* is 1-valent, by Lemma 15, 5*. (p, md') is also
1-valent. In this case, the required 5' is 5??
2.
The path from 5 to 5* has an edge labeled (p,m, --H). Let (p,m,d') be the first
such edge on that path. Let 5' be the source of this edge. By (*), 5'. (p, m,
is monovalent. Since 5'. (p, m, d') has a 1-valent descendent 5*, by Lemma 15,
5' (p, m, d') is 1-valent.			?c1aim 2
Consider the vertex 5' and edge (p, m, d') of Claim 2. By Lemma 11, for each vertex 5"
on the path from 5 to 5' (inclusive), 5". (p, m, d') is also in T?. By (*), all such vertices
5" (p, m, d') are monovalent. In particular, 5. (p, m, d') is monovalent. There are two
cases (see Figure 6):
1. 5. (p, m, d') is 1-valent. Since 5. (p, m, d) is 0-valent, T' has a fork with pivot 5.
2. 5. (p, m, d') is 0-valent. Recall that 5'. (p, m, d') is 1-valent and for each vertex
5" between 5 and 5', 5". (p, m, d') is monovalent. Thus, the path from 5 to 5'
must have two vertices 5o and Si such that 5o is the parent of 5i, 5o (p, m, d') is
0-valent and 5i (p, m, d') is 1-valent. Hence, T? has a hook with pivot 5o.
6.4 Extracting the correct process
By Lemma 19, there is a critical index i. If i is monovalent critical, Lemma 21 below
shows how to extract a correct process. If i is bivalent critical, a correct process can be
found by applying Lemmata 20 and 23.
Lemma 21: If index i is monovalent critical then p? is correct.
PROOF: Suppose, for contradiction, that p? crashes. By Lemma 12(1) (applied to the
root 5 = Sj of T1), there is a finite schedule E that contains only steps of correct
processes (and hence no step of p?) such that all correct processes have decided in E(I?).
Since index i is monovalent critical, the root Si of Ti is 1-valent. Hence all correct
processes must have decided 1 in E(P).
Ii and 1i-i only differ in the state of p?. Since 5 is applicable to Ii and does not
contain any steps of p?, an easy induction on the number of steps in 5 shows that: (a) 5
is also applicable to 1i--Hi, and (b) the state of all processes other than p? are the same in
5(P) and 5(P-'). Using Lemma 9, (a) implies that 5 is also a vertex of yi-1 By (b),
22
AS'
Si (1?valent)ffik
4 Si E
(bivalent)
?Q So (0-valent)
5 E (Correct processes assumed to have
decided 0)
Figure 7: Lemma 22
all correct processes have decided 1 in 5(1i?i) Thus the root of Ti--H1 has tag 1. Since i
is monovalent critical, the root of Ti--H1 is O-valent?a contradiction.
Lemma 22: Let 5 be any bivalent vertex of T?, and So, Si be any 0-valent and 1-valent
descendents of 5. If the paths from 5 to So and from 5 to S? contain only steps of the
form (p, --H, --H), then p is correct.
PROOF: Suppose,
E containing only
1. 5. E is a vertex of Ti and all correct processes have decided in 5 E(Ii).
2. Fork = 0,1, ifSk  is applicable to 1? then 5k E is avertex 0fy?
for contradiction, that p crashes. From Lemma 12, there is a schedule
steps of correct processes (and hence no step of p) such that:
Without loss of generality assume that all correct processes decided 0 in 5 E(Ii). ?efer
to Figure 7. Since all steps in the path from 5 to Si are steps of p, the state of every
process other than p is the same in 5(li) and in 51(1i) Furthermore, any message
addressed to a process other than p that is in the message buffer in 5(li) is still in the
message buffer in 51(1i), Since E is applicable to 5(li) and does not contain any steps
of p, an easy induction on the number of steps in E shows that: (a) E is also applicable
to 51(1i), and (b) the state of every process other than p is the same in 5 (li) and
Si E(Ii). By (ii), (a) implies that S? E(Ii) is a vertex in Ti. By (b), all correct
processes decide 0 in S1 E(Ii). Thus Si, has tag 0. But Si is 1-valent--Ha contradiction.
E)
Lemma 23: The deciding process of a decision gadget is correct.
PROOF: Let ? be any decision gadget of Ti. There are two cases to consider:
23
PS'
So = 5 (p', m', d') (0-valent)
$ 5 (bivalent)
7			DS1=S?(?,m,d)
7 Si = 5, (p',m',d') (1-valent)
El
So
$ Si E
Figure 8: Lemma 23
1. ? is a fork. By Lemma 22, the deciding process of ? is correct.
2.
is a hook. Assume (without loss of generality) that 5 is the pivot of ?, So
S.(p',m',d') is the 0-valent leafof? and S? = S.(p,m,d).(p',m',d') is the 1-valent
leaf of ? (see Figure 8). There are two cases:
(a) p = p'. By Lemma 22, p is correct.
(b) P $ P' Suppose, for contradiction, that p crashes. By Lemma 12, there is a
schedule E containing only steps of correct processes (and hence no step of
P) such that:
i. So E is a vertex of T? and all correct processes have decided in So
Since 5o is 0-valent, all correct processes must have decided 0 in So E(1i).
ii. If E is applicable to 51(P) then Si E is a vertex of T?
Let 5' = 5 (p, m, d) be the parent of Si. The state of every process other than
p is the same in 5(P) and 5'(P). Furthermore, any message addressed to a
process other than p that is in the message buffer in 5(P) is still in the message
buffer in S'(P). Therefore, since So = 5 (p', m', d') and Si = 5' (p', m', d'),
the state of every process other than p is the same in 50(P) and 51(P). In
addition, any message addressed to a process other than p that is in the
24
?Build and tag simulation forest T induced by G?
fori?O,1,...,n:
Ti			simulation tree induced by G and I?
for every vertex 5 of T?
if 5 has a descendent 5' such that a correct process has decided k in 5?(1i)
then add tag k to 5
Select a process from tagged simulation forest T ?
Z & smallest critical index
if i is monovalent critical then return p?
else return deciding process of the smallest decision gadget in Y?
Figure 9: Selecting a correct process
(1)
(2)
(3)
message buffer in 50(li) is also in the message buffer in 51(li). Since E is
applicable to 50(P) and does not contain any steps of p, an easy induction
on the number of steps in E shows that: (a) E is also applicable to 51(P),
and (b) the state of every process other than p is the same in 5o E(P) and
Si E(P). By (ii), (a) implies that Si E is a vertex of T? By (b), all
correct processes decide 0 in Si E(P). Thus Si, receives a tag of 0. But 5i
is 1-valent--Ha contradiction.			z
There may be several critical indices and several decision gadgets in the simulation forest.
Thus, Lemmata 21 and 23 may identify many correct processes. Our selection rule will
choose one of these, as the failure detector ? requires, as follows. It first determines
the smallest critical index i. If i is monovalent critical, it selects p?. If, on the other
hand, i is bivalent critical, it chooses the "smallest" decision gadget in T? according to
some encoding of gadgets, and selects the corresponding deciding process. It is easy to
encode finite graphs as natural numbers. Since a decision gadget is just a finite graph,
the selection rule can use any such encoding. The whole method of selecting a correct
process is shown in Figure 9.
Theorem 24: The algorithm in Figure 9 selects a correct process.
PROOF: By Lemma 19, there is a critical index i 0 < i < n. If i is monovalent critical,
Line 2 returns p? which, by Lemma 21, is correct. If i is bivalent critical, by Lemma 20,
T? contains at least one decision gadget. Let ? be the decision gadget in T? with the
smallest encoding. By Lemma 23, the deciding process of ? is correct in F. Thus, Line
3 returns the identity of a process that is correct.
6.5 The reduction algorithm TDH?
25
The selection of a correct process described in Figure 9 is not yet the distributed algo-
rithm TD?n that we are seeking: it involves an infinite simulation forest and is "central-
ized". To turn it into a distributed algorithm, we will modify it as follows. Each process
will cooperate with other processes to construct ever Increasing finite approximations of
the simulation forest. Such approximations will eventually contain the decision gadget
and the other tagging information necessary to extract the identity of the same correct
process chosen by the selection method in Figure 9.
Note that the selection method in Figure 9 involves three stages: The construction of
a graph representing samples of failure detector values and their temporal relationship;
the construction and tagging of the simulation forest induced by G; and, finally, the
selection of a correct process using this forest.
Algorithm T??? consists of two components. in the first component, each process
repeatedly queries its failure detector module and sends the failure detector values it
sees to the other processes. This component enables correct processes to construct ever
increasing finite approximations of the same G. Since all inter-process communication
occurs in this component, we call it the communication component of Tv??.
in the second component, each process repeatedly (a) constructs and tags the simu-
lation forest induced by its current approximation of G, and (b) selects the identity of
a process using its current simulation forest. Since this component does not require any
communication, we call it the computation component of TD?n.
6.5.1 The communication component
in this component processes cooperate to construct ever increasing approximations of
the same G. Let Gp denote p's current approximation of G. Roughly speaking, each
process p periodically performs the following two tasks: (i) if p receives Gq for some q, it
incorporates this information by replacing Gp with the union of Gp and Gq. (ii) Process
p queries its own failure detector module. Let d be the value that it sees and [p',d'] be
any vertex currently in Gp. Clearly, p saw d after p' saw d'. Thus p adds \J), dj to Gp,
with edges from all other vertices of Gp to [p,dl. Process p then sends its updated Gp to
all other processes. The communication component of Tv?? for p is shown in Figure 10.
Let Gp(t) denote the value of Gp at time t. if p takes a step at time t, Gp(t) denotes
the value of Gp at the end of that step. The next two lemmata establish certain useful
properties of the graphs constructed by the communication component. in reading the
proofs of these results it will be useful to keep in mind that in our model the three phases
of a step receive, failure detection query, and send --H occur atomically at a single time
t.9
9As mentioned in footnote 4, our results would be valid in a model where process steps have a finer
granularity. In such a model the proofs of Lemmata 25 and 26 below would be the same in essence,
although some of the details would be different to account for the fact that the three phases of what is
now considered an atomic step would not necessarily take place at the same time.
26
fBuild the directed acyclic graph Gp?
Gp H empty graph
repeat forever
RECEIVE PHASE:
p receives m
FAILURE DETECTOR QUERY PHASE:
dp			query failure detector D
SEND PHASE:
if m is of the form (q, Gq,p) then
Gp &GpUGq
add [p,dp] to Gp and edges from all other vertices of Gp to [p,dpl
output & Computation component ?Figure 11)
p sends (p,G?,q) to all q ? II
Figure 10: Process p's communication component
(1)
(2)
(3)
(4)
Lemma 25: Let v be a vertex contained in some local graph during the execution of
the communication component. Let Gp(t) be the first graph that contains v. (That is,
v is in Gp(t), but not in Gq(t'), for any process q and time t' ? t.) Then
1. v = [p,dl, and p saw d at time t.
2. If u v is an edge contained in some local graph during the execution of the
communication component then u v is contained in Gp(t).
3. Gp(t) is a subgraph of any graph that contains v.
PROOF: 1. Process p adds v into Gp(t) in Line (1) or (2). In the latter case, the result
follows immediately. In the former case, p must have received a message at time t with
a graph that contains v. The process that sent that message must have therefore had v
in its graph before time t, contradicting the choice of Gp(t) as the first graph to contain
v.
2. Consider the earliest time t' when the edge u v was added to some graph, say of
process q. By definition of t, t' > t. If t' > t, at time t' process p' receives a message that
contains a graph with the edge ?			v. The sender of that message had a graph that
contained the edge u			v at some time before t', contrary to the choice of t'. Therefore
it must be that t' = t. Then, by Part (1), q = p and so tt v is in Gp(t), as wanted.
3. Suppose, for contradiction, that some graph contains v but is not a supergraph of
Gp(t). Choose the first such graph, say, Gq(t'). By definition of t, t' > t. Clearly, q # p
because p never removes any vertices or edges from its own graph. Therefore, at time
t' process q receives a message with a graph that contains v but is not a supergraph of
27
Gp(t). The sender of that message must have had a graph that contains v but is not a
supergraph of Gp(t) before time t', contrary to the choice of Gq(t').
Recall that we are considering a fixed run of Tv?n, with failure pattern F, and fail-
ure detector history Hv ? D(F). We now prove that the graphs constructed by the
communication component of TD?n satisfy certain properties. The reader should note
the similarity between the first four and the four properties of the graphs defined in
Section 6.1.
Lemma 26: For any correct process p and time t:
1. The vertices of Gp(t) are of the form [pt,d?] where p' c II and d' ? 1ZD If ?`, d'] is
a vertex of Gp(t), then there is a time t' such that p' ? F(t') and d' = HD(p', t').
2. If [qi,di] H [q2,d2] is an edge of Gp(t) and d1 = H?(q1,t1) and d2 = Hv(q2,t2)
then t1 ? t2.
3. Gp(t) is transitively closed.
4. There is a time t' > t and a failure detector value d such that for all vertices ?`, d']
of Gp(t), [p',d'] [p,d] is an edge of Gp(t').
5. Gp(t) is a subgraph of Gp(t'), for all t' > t.
6. For all correct q, there is a time t' > t such that Gp(t) is a subgraph of Gq(t').
PROOF:
Property 1 : Consider the first graph that contains the vertex W, d']. By Lemma 25(1),
this graph is Gpi(t') for some time t', and p' saw d' at time t'. This means that
? F(t') (otherwise p' would not have taken a step at time t' and would not have
seen d'), and d' = Hv(p',t'), as wanted.
Property 2 : By Lemma 25(2), [q1,d1] [q2,d2j is an edge of Cq2(t2). Let t' be the
time when q? inserted vertex [q1, d1] into Gq2. Of course, t' < t2. There are two
cases:
1. t' < t2. By Lemma 25(1), [q1,d1] was not in any graph before time t1. Thus,
t1 < t' and from the hypothesis of this case, t1 < t2.
2. t' = t. Then q? received a graph containing [q?, d1] at t2. Let t" be the time
when this graph was sent. Of course, ? t2. By Lemma 25(1), [qi, d1] was
not in any graph before t1, and therefore t1 < t". Thus, t1 ? t2.
28
Property 3 : Let [q1,d1] o+., H [q?,d?] be a path in Gp(t). We must show that there
is an edge [qi,di] H [q?,d?] ill Gp(t).
Let tj be the time when q? inserted [q?, dj] in Gq?,for 1< i ? k. By induction on i we
show that [q1, d1] . . . [q?, dj] is a path in Gqj (tj). The basis, i = 1, is trivial. For
the induction step, suppose that [q1, d1] ... [q??1, di?1] is a path in Gqj?1 (tj?1).
Since [q??i,d??1J [q?,d?] is an edge in Gp(t), by Lemma 25(2), it is also an edge
in Gqj(ti). Since [Q?1,dj?1j is a vertex in Gqj(ti), by Lemma 25(3), Gqj?1(ti?i) is
a subgraph of Gqj(ti). In particular, Cqj(ti) contains the path [q?, d1?			...
[q??i,d??i]. Thus, [q1,d1]			H ki,di] is a path in Gqj(ti), as wanted.
Therefore, the vertices [q1,d1],... , [qk, dk] are all in Gq?(t?) At time tk, qk adds
an edge from every other vertex to [q?, dk]. Thus, the edge [q1, d1] [qk, dk] is in
Gq?(tk). By Lemma 25(3), Gq?(t?) is a subgraph of Gp(t) (since the latter contains
[q?,d?]). Therefore, [q1,di] H [q?,d?j is in Gp(t), as wanted.
Property 5 : Once a vertex or edge is added to Gp it is not removed.
Property 4 : Since p is correct, it takes a step at some time t' after t. In the failure
detector query phase of this step, p queries its failure detector module and obtains
a value, say d. In Line 2 of this step, p adds the vertex [p,dj to Gp and an edge
from all other vertices of Gp(t') to [p,d]. From Property 5, Gp(t) is a subgraph of
Gp(t'), hence the result follows.
Property 6 : Since p is correct, it eventually sends Gp(t) to all processes, including
q (this occurs in p's first execution of Line 4 after time t). Since q is correct, it
eventually receives Gp(t), and then replaces Gq with Gq U Gp(t), say at time t'. So,
Gp(t) is a subgraph of Gq(t').			El
Property 5 of the above lemma allows us to define Gp? = UtET Gp(t). From Property 6,
we get:
Lemma 27: For any correct processes p and q, Gp? = Gq?
PROOF: Let 0 be any vertex or edge of Gp? i.e.,there is a time t at which 0 is in Gp(t).
From Lemma 26 (6), there is a time t' such that Gp(t) is a subgraph of Gq(t'). Thus 0
is in Cq? Thus Cp? is a subgraph of Gq? By a symmetric argument, Gq? is a subgraph
of			hence			= Gq?			El
Lemma 27 allows us to define the limit graph G to be Gp? for any correct process p. The
first four properties of Lemma 26 immediately imply:
Lemma 28: The limit graph G satisfies the four properties of the DAG defined in
Section 6.1.
As before, T? denotes the tagged simulation tree induced by G and initial configuration
P, and T denotes the tagged simulation forest f10,T',... ,
29
6.5.2 The computation component
Since the limit graph & has the four properties of the DAG, we can apply the "central-
ized" selection method of Figure 9 to identify a correct process. This method involved:
o+ Constructing and tagging the infinite simulation forest T induced by G.
o+ Applying a rule to T to select a particular correct process p*.
In the computation component of TD??, each p approximates the above method by
repeatedly:
o+ Constructing and tagging the finitesimulation forest Tp induced by Gp, its present
finite approximation of &
o+ Applying the same rule to Tp to select a particular process.
Since the limit of Yp over time is T, and the information necessary to select p* is in a
finite subgraph ofT, we can show that eventuallypwill keep selecting the correct process
p*, forever.
Actually, p cannot quite use the tagging method of Figure 9: that method requires
knowing which processes are correct! Instead, passigns tag k to a vertex 5 in Tp? if and
only if 5 has a descendent 5' such that p itself has decided k in 5'(P). If p is correct,
this is eventually equivalent to the tagging method of Figure 9. If p crashes, we do not
care how it tags its forest. Also, pcannot use exactly the same selection method as that
of Figure 9: its current simulation forest Tp may not yet have a critical index or contain
any decision gadget (although it eventually will!). In that case, p temporizes by just
selecting itself. The computation component of Tv?n is shown in Figure 11 (compare it
with the selection method of Figure 9).
We first show that Tp? the simulation forest that p constructs, is indeed an increas-
ingly accurate approximation of T (Lemma 29). We then show that the tags that pgives
to any vertex 5 in Tp are eventually the same ones that the tagging rule of Figure 9 gives
to 5 in T (Lemma 30). Let Tp(t) denote Tp at time t, i.e., Tp(t) is the finite simulation
forest induced by Gp(t).
Lemma 29: For any correct p and any time t:
1. Tp(t) is a subgraphiO of T
2. Tp(t) is a subgraph of Tp(t') for all t' > t.
3. UTp(t)?T
tET
`0The subgraph and graph equality relations ignore the tags.
30
fBuild and tag simulation forest Tp induced by CpJ
for i			0,1,... , n:
Tp? H simulation tree induced by Up and I'
for every vertex 5 of Yp?
if 5 has a descendent 5' such that p has decided k in 5?(1i)
then add tag k to 5
f5elect a process from tagged simulation forest ?p1
if there is no critical index then return p
else
Z H smallest critical index
if i is monovalent critical then return p?
else if Tp? has no decision gadgets then return p
else return deciding process of the smallest decision gadget in Yp?
Figure 11: Process p's computation component
(1)
(2)
(3)
PROOF:
Property 1 Let 5 be any vertex of tree Tp?(t) (for some i 0 < i ? n). From the
definition of Yp?(t) 5 is compatible with some path g of Gp(t) and applicable to
I?. Since Gp(t) is a subgraph of C, g is also a path of C. Thus, 5 is compatible
with G; since it is also applicable to Pit is a vertex of T?.
Similarly, let 5 5' be an edge e of Yp?(t) Since 5 and 5' are also vertices of T?
and 5' = 5. e, 5 H 5' is also an edge of T?.
Property 2 : Follows from Lemma 26 (5).
Property 3 : We first show that T is a subgraph of UtET Tp(t). Let 5 be any vertex of
any tree T? of T. From the definition of T?, 5 is compatible with some finite path
g of G and is applicable to P. Since C = UtET Cp(t) and g is a finite path of G,
there is a time t such that g is also a path of Gp(t). Since 5 is compatible with g
of Gp(t) and is applicable to ii, 5 is a vertex of Yp?(t)
Let 5 5' be any edge e of T?. By the argument above, there is a time t after
which both 5 and 5' are vertices of T? Since 5' = 5 e, after time t the edge e is
also in T? Thus, every vertex and
every edge of T is also in UtET Tp(t), i.e., T is
a subgraph of UtET Tp(t).
By Property 1, UtET Tp(t) = T.
31
Lemma 30: Let p be any correct process, and 5 be any vertex of Tp. There is a time
after which the tags of 5 in Tp are the same as the tags of 5 in T
PROOF: Suppose that at some time t, p assigns tag k to vertex 5 of tree Yp? This means
that 5 has a descendent 5' in Tp?(t) such that p has decided k in 5'(I'). By Lemma 29(1),
5' is also a descendent of 5 in T?, and since p is correct, 5 has tag k in T' as well.
Conversely, suppose a vertex 5 of a tree T? of T has tag k. We show that, eventually,
p also assigns tag k to 5 in Tp' Since 5 has tag k in T?, 5 has a descendent 5' in T'
such that some correct process has decided k in 5t(1i) (cf. tagging rule in Figure 9).
By Lemma 12(1), there is a descendent 5" of 5' in T', such that all correct processes,
including p, have decided in 5t1(1i). By Lemma 7, 511(P) is a configuration of a partial
run of Consensn?. By the Agreement property of Consensus, p must have decided k in
5"(1i). Consider the path that starts from the root of T' and goes to vertex 5 and then
to 5". By Lemma 29(3), there is a time t after which this path is also in Yp? Therefore,
when p executes the tagging rule of Figure 11 after time t, p assigns tag k to 5 in Tp?
(because p has decided k in 5"(I'), and 5" is a descendent of 5 in
Recall that p* is the correct process obtained by applying the selection rule of Figure 9
to the infinite simulation forest T. We now show that there is a time after which any
correct p always selects p* when it applies the corresponding selection rule of Figure 11 to
its own finite approximation of the simulation forest Yp. Roughly speaking, the reason
is as follows. By Lemma 30, there is a time t after which the tags of all the roots in
p's forest Yp are the same as in the infinite forest T. Since these tags determine the
sets of monovalent and bivalent critical indices, after time t these sets according to p are
the same as in T. Let i be the minimum critical index in these sets, and consider the
situation after time t. If i is monovalent critical, the selection rule of Figure 11 selects p?,
which is what p* is in this case. If i is bivalent critical, then p selects the deciding process
of its current minimum decision gadget of Tp? (if it has one). This case is examined below.
Let ?? be the minimum decision gadget of T' (so, p* is the deciding process of ?*)
For a while, ?? may not be the minimum decision gadget of T?. T
?? (and its tags) is not yet in T?. However, by Lemmata 29(3) ? his may be because
p			and 30, ?? (including its
tags) will eventually be in Tp? Alternatively, it may be because Tp? contains a subgraph
? whose encoding is smaller than ?*??, and for a while ? looks like a decision gadget
according to its present tags. However, by Lemma 30, p will eventually determine all
the tags of ?, and discover that ? is not really a decision gadget. Since there are only
finitely many graphs whose encoding is smaller than ?*??? p will eventually discard all
the "fake" decision gadgets (like ?) that are smaller than ??, and then select ?? as its
minimum decision gadget. After that time, p always selects the deciding process of ??
--H which is precisely p*, in this case.
Theorem 31: For all correct processes p, there is a time after which OUtPUtp =
forever.
PROOF: Let i* denote the critical index selected by Line 1 of Figure 9 applied to T. By
32
Lemma 30, there is a time tinit after which every root of Tp has the same tags as the
corresponding root ofT. Thus after time tinit, P always sets z' = i* in Line 1 of Figure 11.
We now show that there is a time after which the computation component of p (Figure
11) always return p*. There are two cases:
1.
2.
is monovalent critical, In this case, p* is process pi (by Line 2 of the selection
rule Figure 9). Similarly, after time t???t: (a) p always sets i to i* (Line 1 of
Figure 11); (b) p always returns pi (Line 2 of Figure 11).
is bivalent critical, Let ?? denote the smallest decision gadget of T??. In this case
p* is the deciding process of?*. Since `t* is a finite subgraph of T??, by Lemma 29(3),
there is a time after which `t* is also a subgraph of Yp? By Lemma 30, there is a
time t? after which all the (finitely many) vertices of ?? receive the same tags in
T? and Tp?? Thus after time t??, ?? is a also decision gadget of
Since each graph is encoded as a unique natural number, there are finitely many
graphs with a smaller encoding than ??. Let ? denote the set of graphs with a
smaller encoding than ??, and't be any graph in ?. We show that there is a time
after which ? is not a decision gadget of Tp? There are two cases:
(a) ? is not a subgraph of T??. In this case, by Lemma 29(1), ? is never a subgraph
of
is a subgraph of T?* Since `t* is the smallest decision gadget of T? and ?
is smaller than `t*, `t is not a decision gadget of T??. By Lemma 30, there is
a time t? after which all the (finitely many) vertices of ? have the same tags
in T? and Tp?' Thus after time t?, `t is not a decision gadget of Tp?
Since ? is finite, there is a time t? after which no graph in g is a decision gadget
of
Consider the process that is returned by the computation component of p (Fig-
ure 11) at any time t > max(ti?it,t?.,t?). Since t > tinit, P always sets i to i* in
Line 1. Since t > t??, ?? is a decision gadget of Tp?(t) Finally, since t > t?, `t* is
the smallest decision gadget of Tp?(t) Thus, since i* is bivalent, at any time after
max(t????, t?., tg), Line 3 of Figure 11 returns the deciding process of ??. Therefore,
after time max(t???t, t??, t?), the computation component of p always returns p*,
From the above, there is a time after which p sets output? & p*, forever, in Line 3 of
Figure 10.
We now have all the pieces needed to prove our main result, Theorem 2 in Section 5:
Theorem 2: For all environments ?, if a failure detector D can be used to solve Con-
sensus in S, then D ??
PROOF: Consider the execution of algorithm Tv?n in any environment ?. By Theo-
rem 31, there is a time after which all correct processes set OfltpUtp = p*, forever. By
33
Theorem 24, p* is a correct process. Thus, Tv?n is a reduction algoritlim that transforms
D into ?. In other words, ? is reducible to D.
7 Discussion
7.1 Granularity of atomic actions
Our model incorporates very strong assumptions about the atomicity of steps. First,
the three phases of each step are assumed to occur indivisibly, and at a single time. In
particular, the failure of a process cannot happen in the "middle of a step". This allows
us to associate a single time t with a step and think of the step as occuring at that
time. Second, in the send phase of a step a message is sent to all processes. Given that
the entire step is indivisible, this means that either all or none of the correct processes
eventually receive the message sent in a step. Finally, no two steps can occur at the
same time.11 These assumptions are convenient because they make the formal model
simpler to describe. Also, they are consistent with those made in the model of [FLP85J
that provided the impetus for this work.
On the other hand, in [CT9l] a model with weaker properties is used. There, the
three phases of a step need not occur indivisibly, and may occur at different times. Even
within the send phase, the messages sent to the different processes may be sent at different
times. Thus, a failure may occur in the middle of the send phase, resulting in some correct
processes eventually receiving the messages sent to them in that step while others never
do. Also, actions of different processes may take place simultaneously, subject to the
restriction that a message can only be received strictly ajter it was sent. Since [CT9lJ is
mainly concerned with showing how to use various types of failure detectors to achieve
Consensus, the use of a weaker model strengthens the results. (In fact, the negative
results of [CT91] hold even in the model of this paper, with the stronger atomicity
assumptions.)
The question naturally arises whether our result also applies to this weaker model.
In other words, if a failure detector D can be used to solve Consensus in the weak model,
is it true that we can transform D to )`V in that model? It turns out that the answer is
affirmative. To see this, first note that if D solves Consensus in the weak model then it
surely solves Consensus in the strong model. By our result, D can be transformed to VV
in the strong model. It remalns to show that D can be transformed to W in the weak
model. This is not obvious, since it is conceivable that the extra properties of the strong
model are crucial in the transformation of D to W. Fortunately, the transformation
presented in this paper actually works even in the weak model!
To see this, it is sufficient to make sure that the communication component of the
transformation (cf. Figure 10 in Section 6.5.1) constructs graphs that satisfy the proper-
ties listed in Lemma 26, even if we run it in the weak model. It is not difficult to verify
11This is reflected in our formal model by the fact that the list of times in a run (which indicate when
the events in the run's schedule occur) is increasing.
34
that this is indeed so. The proof is virtually the same, except for the fact that we must
distinguish the time tin which a process p queries its failure detector and the time t' in
which p adds the value it saw into Gp. In our proof we assume that t = t'; in the weak
model we would have t ? t'. Similar comments apply to all actions within a step that are
no longer assumed to occur at the same instant of time. These changes make the proofs
slightly more cumbersome, since we must introduce notation for all the different times
in which relevant actions within a step take place, but the reasoning remains essentially
the same.12
Thus, our result is not merely a fortuitous consequence of some whimsical choice of
model. We view the robustness of the result across different models of asynchrony as
further testimony to the significance of the failure detector W.
7.2 Failure detection and partial synclirony
The fundamental reason why Consensus cannot be solved in completely asynchronous
systems is the fact that, in such systems, it is impossible to reliably distinguish a pro-
cess that has crashed from one that is merely very slow. In other words, Consensus
is unsolvable because accurate failure detection is impossible. On the other hand, it
is well-known that Consensus is solvable (deterministically) in completely synchronous
systems --H that is, systems where all processes take steps at the same rate and each
message arrives at its destination a fixed and known amount of time after it is sent. In
such a system we can use timeouts to implement a "perfect" failure detector --H i.e., one
in which no process is ever wrongly suspected, and every faulty process is eventually
suspected. Thus the ability to solve Consensus in a given system is intimately related to
the failure detection capabilities of that system. This realization led to the extension of
the asynchronous model of computation with failure detectors in [CT91]. In that paper
Consensus is shown to be solvable even with very weak failure detectors that could make
an infinite number of "mistakes
A different tack on circumventing the unsolvability of Consensus is pursued in [DDS87]
and [DLS88]. The approach of those papers is based on the observation that between
the completely synchronous and completely asynchronous models of distributed systems
there lie a variety of intermediate "partially synchronous" models. For instance, in one
model of partial synchrony, processes take steps at the same rate, but message delays
are unbounded (albeit finite). Alternatively, it may be known that message delays are
bounded, but the actual bound may be unknown. In yet another variation, the eventual
maximum message delay is known, but during some initial period of finite but unknown
duration some messages may experience longer delays. These and many other models of
`2Another problem that must be confronted is that in the proofs of Lemmata 25 and 26 we often refer
to the "first graph" in which a vertex or edge is present. In the strong model there is no difficulty with
this, since processes cannot execute steps simultaneously. In the weak model, we have to justify that it
makes sense to speak of the "first" graph to contain a vertex or edge, in spite of the fact that certain
actions can be executed at the same time. The fact that a message can be received only after it was
sent is needed here.
35
partial synchrony are studied in [DDS87] and [DLS88J, and the question of solvability of
Consensus in each of them is answered either positively or negatively.
In particular, [DDS87j defines a space of 32 models by considering five key parameters,
each of which admits a "favourable" and an "unfavourable" setting. For instance, one
of the parameters is whether the maximum message delay is known (favourable setting)
or not (unfavourable setting). Each of the 32 models corresponds to a particular setting
of the 5 parameters. [DDS87j identifies four "minimal" models in which Consensus is
solvable. These are minimal in the sense that the weakening of any parameter from
favourable to unfavourable would yield a model of partial synchrony where Consensus
is unsolvable. Thus, within the space of the models considered, [DDS87j and [DLS88]
delineate precisely the boundary between solvability and unsolvability of Consensus, and
provide an answer to the question "What is the least amount of synchrony sufficient to
solve Consensus?".
Failure detectors can be viewed as a more abstract and modular way of incorporating
partial synchrony assumptions into the model of computation. Instead of focusing on
the operational features of partial synchrony (such as the five parameters considered
in [DDS87J), we can consider the axiomatic properties that failure detectors must have
in order to solve Consensus. The problem of implementing a given failure detector in
a specific model of partial synchrony becomes a separate issue; this separation affords
greater modularity.
To see the connection between partial synchrony and failure detectors, it is useful to
examine how one might go ab?ut implementing a failure detector. By the impossibility
result of [FLP85], a failure detector that can be used to solve Consensus cannot be
implemented in a completely asynchronous system. Now consider partially synchronous
systems in which correct processes have accurate timers (i.e., they can measure elapsed
time). If in such a system message delays are bounded and the maximum delay is known,
we can use timeouts to implement the "perfect" failure detector described above. In a
weaker system where message delays are bounded but the maximum delay is not known,
we can implement a failure detector satisfying a weaker property: eventually no correct
process is suspected. This can be done by using timeouts of increasing length; once the
timeout period has been increased sufficiently to exceed the unknown maximum delay,
no correct process will be suspected. A failure detector with the same property can also
be implemented in a distributed system where the eventual maximum message delay
is known, but messages may be delayed for longer during some initial period of finite
but unknown duration. With these remarks we wish to illustrate two points: First,
that stronger failure detectors correspond to stronger models of partial synchrony; and
second, that the same failure detector can be implemented in different models of partial
synchrony.
Studying failure detectors rather than various models of partial synchrony has several
advantages. By determining whether Consensus is solvable using some specific failure
detector we thereby determine whether Consensus is solvable in all systems in which
that failure detector can be implemented. An algorithm that relies on the axiomatic
36
properties of a given failure detector is more general, more modular, and simpler to
understand than one that relies directly on some specific operational features of partial
synchrony (that can be used to implement the given failure detector).
From this more abstract point of view, the question "What is the least amount of
synchrony sufficient to solve Consensus?" translates to "What is the weakest failure
detector sufficient to solve Consensus?". In contrast to [DDS87J, which identified a set
of minimal models of partial synchrony in which Consensus is solvable, we are able to
exhibit a single minimum failure detector that can be used to solve Consensus. The
technical device that made this possible is the notion of reduction between failure de-
tectors. We suspect that a corresponding notion of reduction between models of partial
synchrony, although possible, would be more complex. This is because there are models
which are not comparable in general (in the sense that there are tasks that are possible
in one but not in the other and vice versa), although they are comparable as far as
failure detection is concerned --H which is all that matters for solving Consensus! In this
connection, it is useful to recall our earlier observation, that the same failure detector
can be implemented in different (indeed, incomparable) models of partial synchrony.
7.3 Failure detectors with infinite range of output values
The failure detectors in [RB9l, CT9l] only output lists of processes suspected to have
crashed. Since the set of processes is finite, the range of possible output values of these
failure detectors is also finite. In this paper our model allows for failure detectors with
arbitrary ranges of output values, including the possibility of infinite ranges! We illustrate
the significance of this generality by describing a natural class of failure detectors whose
range of output values is infinite (though each value output is finite).
Example: One apparent weakness with our formulation of failure detection is that a
brief change in the value output by a failure detector module may go unnoticed. For
example, process p`s module of the given failure detector, D, may output d1 at time
t1, d2 at a later time t2 and d1 again at time t3 after t2. If due to the asynchrony of
the system p does not take a step between time t1 and t3, p may never notice that its
failure detector module briefly output d2. A natural way of overcoming this problem
is to replace D with failure detector D' that has the following property: D' maintains
the same list of suspects as D but when queried, D' returns the entire history of its list
of suspects up to the present time. In this manner, correct processes are guaranteed
to notice every change in D"s list of suspects. As the system continues executing, the
values output by D' grow in size. This means that D' has an infinite range of output
values.
However, since D is a function of F, the failure pattern encountered, V' is also a
function of F, and can be described by our model. Thus, the result in this paper applies
to V', a natural failure detector with infinite range of output values.
37
Acknowledgements
We would like to thank Prasad Jayanti for his valuable comments on various versions
of this paper. We would also like to thank Cynthia Dwork, Dexter Kozen and the
distributed systems group at Cornell for putting up with half-baked ideas,
References
?ABD+87]
[BMZ88j
Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Daphne Koller, David Peleg,
and Riidiger Reischuk. Achievable cases in an asynchronous environment. In
Proceedings of the Twenty-Eighth Symposium on Foundations of Computer
Science, pages 337--H346. IEEE Computer Society Press, October 1987.
Ofer Biran, Shlomo Moran, and Shmuel Zaks. A combinatorial characteriza-
tion of the distributed tasks that are solvable in the presence of one faulty
processor. In Proceedings of the Seventh ACM Symposium on Principles of
Distributed Computing, pages 263--H275, August 1988.
[BW87j M. Bridgiand and R. Watro. Fault-tolerant decision making in totally asyn-
chronous distributed systems. In Proceedings of the Sixth ACM Symposium
on Principles of Distributed Computing, pages 52--H63, August 1987.
[CD89] Benny Chor and Cynthia Dwork. Randomization in byzantine agreement.
Advances in Computer Research, 5:443--H497,1989.
[CT9l]
DDS87]
Thshar Chandra and Sam Toueg. Unreliable failure detectors for asyn-
chronous systems (preliminary version). In Proceedings of the Tenth ACM
Symposium on Principles of Distributed Computing, pages 325--H340. ACM
Press, August 1991.
Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal syn-
chronism needed for distributed consensus. Journal of the ACM, 34(1):77--H97,
January 1987.
[DLP+86] Danny Dolev, Nancy A. Lynch, Shlomit 5. Pinter, Eugene W. Stark, and
William E. Weihl. Reaching approximate agreement in the presence of faults.
Journal of the ACM, 33(3):499--H516, July 1986.
[DLS88] Cynthia Dwork, Nancy A. Lynch, and Larry Stockmeyer. Consensus in the
presence of partial synchrony. Journal of the ACM, 35(2):288--H323, April 1988.
Michael J. Fischer, Nancy A. Lynch, and Michael 5. Paterson. Impossibility of
distributed consensus with one faulty process. Journal of the ACM, 32(2):374--H
382, April 1985.
[FLP85]
38
Aleta Ricciardi and Ken Birman. Using process groups to implement failure
detection in asynchronous environments. In Proceedings of the Tenth ACM
Symposium on Principles of Distributed Computing, pages 341--H351. ACM
Press, August 1991.
[RB91]
