BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR94-1426 
ENTRY:: 1994-06-27
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: The Weakest Failure Detector for Solving Consensus
AUTHOR:: Chandra, Tushar Deepak
AUTHOR:: Hadzilacos, Vassos 
AUTHOR:: Toueg, Sam
DATE:: May 1994
PAGES:: 42
NOTES:: Replaces 92-1293
ABSTRACT:: 
We determine what information about failures is necessary and sufficient to 
solve Consensus in asynchronous distributed systems subject to crash failures. 
In [CT91], we proved that $\Diamond\cal W$, a failure detector that provides 
surprisingly little information about which processes have crashed, is 
sufficient to solve Consensus in asynchronous systems with a majority of 
correct processes. In this paper, we prove that to solve Consensus, any 
failure detector has to provide at least as much information as 
$\Diamond\cal W$. Thus, $\Diamond\cal W$ is indeed the weakest failure 
detector for solving Consensus in asynchronous systems with a majority of 
correct processes.
END:: CORNELLCS//TR94-1426 
BODY::
The Weakest Failure Detector for
Solving Consensus
Tushar Deepak Chandra*
Vassos Hadzilacos**
Sam Toueg
TR 94-1426
(replaces TR 92-1293)
May 1994
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
 Research supported by NSF grants CCR-8901 780 and CCR-91 02231,
DARPNNASA Ames grant NAG-2-593, grants from the IBM Endicoft Programming
Laboratory and Siemens Corp., and a grant from the Natural Science and
Engineering Research Council of Canada. A preliminary version of this paper
appeared in Proceedmgs of the Eleventh ACM Symposium on Principles of
Distributed Computing, pages 147-158. ACM press, August 1992.
* Also supported by an IBM graduate fellowship.
** Computer Systems Research Institute, University of Toronto, 6 King's College
Road, Toronto, Ontario M5S 1 Al.
Dept. of Computer Science, Upson Hall, Cornell University, Ithaca, NY 14853.
The Weakest Failure Detector for Solving
Consensus*
Tusliar Deepak Cliandrat
Vassos Hadzilacos?
May 9,1994
Abstract
Sam Toueg
We determine what information about failures is necessary and sufficient to
solve Consensus ill asynclironous distributed systems subject to crash failures. In
[CT91J, we proved that ?W, a failure detector that provides surprisingly little
information about which processes have crashed, is sufficient to solve Consensus
ill asynchronous systems with a majority of correct processes. In this paper, we
prove that to solve Consensus, any failure detector has to provide at least as much
information as ?W. Thus, ?YV is indeed the weakest failure detector for solving
Consensus ill asynchronous systems with a majority of correct processes.
1 Introduction
1.1 Background
The asynchronous model of distributed computing has been extensively studied. Infor-
mally, an asynchranous distributed system is one in which message transmission times
and relative processor speeds are both unbounded. Thus an algorithm designed for
an asynchronous system does not rely on such bounds for its correctness. In practice,
asynchrony is introduced by unpredictable loads on the system.
*Research supported by NSF grants CCR-8901780 and CCR-9102231, DARPA/NASA Ames grant
NAG-2-593, grants from the IBM Endicott Programming Laboratory and Siemens Corp, and a grant
from the Natural Sciences and Engineering Research Council of Canada. A preliminary version of this
paper appeared in Proceedings of?he Eleven?h ACM Symposium on Principles of Distributed Computing,
pages 147--H158. ACM press, August 1992.
?Al50 supported by an IBM graduate fellowship.
?Comput& Systems Research Institute, University of Toronto, 6 King's College Road, Toronto,
Ontario M5S lAl.
Dept. of Computer Science, Upson Hall, Cornell University, Ithaca, NY 14853.
2
Although the asynchronous model of computation is attractive for the reasons out-
lined above, it is well-known that many fundamental problems of fault-tolerant dis-
tributed computing that are solvable in synchronous systems, are unsolvable in asyn-
chronous systems. In particular, it is well-known that Consensus, and several forms
of reliable broadcast, including Atomic Broadcast, cannot be solved deterministically in
an asynchronous system that is subject to even a single crash failure [FLP85, DDS87J.
Essentially, these impossibility results stem from the inherent difficulty of determining
whether a process has actually crashed or is only "very slow".
To circumvent these impossibility results, previous research focused on the use of ran-
domization techniques [CD89J,the definition of some weaker problems and their solutions
[DLP+86, ABD+87, BW87, BMZ88J, or the study of several models of partial synchrony
[DDS87, DLS88J. liowever, the impossibility of deterministic solutions to many agree-
ment problems (such as Consensus and Atomic Broadcast) remains a major obstacle to
the use of the asynchronous model of computation for fault-tolerant distributed comput-
ing.
An alternative approach to circumvent such impossibility results is to augment the
asynchronous model of computation with a failure detector. Informally, a failure detector
is a distributed oracle that gives (possibly incorrect) hints about which processes may
have crashed so far: Each process has access to a local failure detector module that
monitors other processes in the system, and maintains a list of those that it currently
suspects to have crashed. Each process periodically consults its failure detector module,
and uses the list of suspects returned in solving Consensus.
A failure detector module can make mistakes by erroneously adding processes to its
list of suspects: i.e., it can suspect that a process p has crashed even though p is still
running. If it later believes that suspecting p was a mistake, it can remove p from its list.
Thus, each module may repeatedly add and remove processes from its list of suspects.
Furthermore, at any given time the failure detector modules at two different processes
may have different lists of suspects.
It is important to note that the mistakes made by a failure detector should not prevent
any correct process from behaving according to specification. For example, consider an
algorithm that uses a failure detector to solve Atomic Broadcast in an asynchronous
system. Suppose all the failure detector modules wrongly (and permanently) suspect
that a correct process p has crashed. The Atomic Broadcast algorithm must still ensure
that p delivers the same set of messages, in the same order, as all the other correct
processes. Furthermore, if p broadcasts a message m, all correct processes must deliver
m.1
In [CT9lJ, we showed that a surprisingly weak failure detector is sufficient to solve
Consensus and Atomic Broadcast in asynchronous systems with a majority of correct
processes. This failure detector, called the eventually weak failure detector and denoted
`A different approach was taken in [RB9lj: a correct process that is wrongly suspected to have
crashed, voluntarily leaves the system. It may later rejoin the system by assuming a new identity.
3
A, here, satisfies only the following two properties:2
1. There is a time after which every process that crashes is always suspected by some
correct process.
2. There is a time after which some correct process is never suspected by any correct
process.
Note that, at any given time 1, processes cannot use A, to determine the identity of a
correct process. Furthermore, they cannot determine whether there is a correct process
that will not be suspected after time t.
The failure detector A, can make an infinite number of mistakes. In fact, it can forever
add and then remove some correct processes from the lists of suspects (this reflects
the inherent difficulty of determining whether a process is just slow or has crashed).
Moreover, some correct processes may be erroneously suspected to have crashed by all
the other processes throughout the entire execution.
The two properties of A, state that eventually something must hold forever; this
may appear too strong a requirement to implement in practice. However, when solving a
problem that "terminates", such as Consensus, it is not really required that the properties
hold forever, but merely that they hold for a sufficiently long time, i.e., long enough for
the algorithm that uses the failure detector to achieve its goal. For instance, in practice
the algorithm of [CT9iJ that solves Consensus using A, only needs the two properties of
A, to hold for a relatively short period of time.3 However, in an asynchronous system it
is not possible to quantify "sufficiently long", since even a single process step or a single
message transmission is allowed to take an arbitrarily long amount of time. Thus it is
convenient to state the properties of A, in the stronger form given above.
1.2 The problem
The failure detection properties of A, are sufficient to solve Consensus in asynchronous
systems. But are they necessary? For example, consider failure detector A that satisfies
Property 1 of A, and the following weakening of Property 2:
There is a time after which some correct process is never suspected by at
least 99% of the correct processes.
A is clearly weaker than A,. Is it possible to solve Consensus using A? Indeed what
is the weakest failure detector sufficient to solve Consensus in asynchronous systems?
In trying to answer this fundamental question we run into a problem. Consider failure
detector 13 that satisfies the following two properties:
2In [CT9lj, this was denoted ?)`V.
3In that algorithm processes are cyclically elected as "coordinators". Consensus is achieved as soon
as a correct coordinator is reached, and no process suspects it to have crashed while this coordinator is
trying to enforce consensus.
4
1. There is a time after which every process that crashes is always suspected by all
correct processes.
2. There is a time after which some correct process is never suspected by a majority
of the processes.
It seems that 13 and A, are incomparable: 13's first property is stronger than A,'s, and 13's
second property is weaker than A,'s. Is it possible to solve Consensus in an asynchronous
system using 13? The answer turns out to be "yes" (provided that this asynchronous
system has a majority of correct processes, as A, also requires). Since A, and 13 appear
to be incomparable, one may be tempted to conclude that A, cannot be the "weakest"
failure detector with which Consensus is solvable. Even worse, it raises the possibility
that no such "weakest" failure detector exists.
However, a closer examination reveals that 13 and A, are indeed comparable in a
natural way: There is a distributed algorithm T??? that can transform 13 into a failure
detector with the Properties 1 and 2 of A, Ts?vv works for any asynchronous system
that has a majority of correct processes. We say that A, is reducible to 13 in such a
system. Since TB?vv is able to transform 13 into A, in an asynchronous system, 13 must
provide at least as much information about process failures as A, does. Intuitively, 13 is
at least as strong as A,.
1.3 The result
In [CT91J, we showed that A, is sufficient to solve Consensus in asynchronous systems
if and only if n> 2f (where n is the total number of processes, and f is the maximum
number of processes that may crash). In this paper, we prove that A, is reducible
to any failure detector D that can be used to solve Consensus (this result holds for
any asynchronous system). We show this reduction by giving a distributed algorithm
Tv???v that transforms any such D into A,. Therefore, A, is indeed the weakest failure
detector that can be used to solve Consensus in asynchronous systems with n > 2f.
Furthermore if n < 2f, any failure detector that can be used to solve Consensus must
be strictly stronger than A,.
The task of transforming any given failure detector D (that can be used to solve
Consensus) into A, runs into a serious technical difficulty for the following reasons:
To strengthen our result, we do not restrict the output of D to lists of suspects.
Instead, this output can be any value that encodes some information about failures.
For example, a failure detector D should be allowed to output any boolean formula,
such as "(not p) and (q or r)" (i.e., p is up and either q or r has crashed)--Hor any
encoding of such a formula. Indeed, the output of D could be an arbitrarily complex
(and unknown) encoding of failure information. Our transformation from D into
A, must be able to decode this information.
5
o+ Even if the failure information provided by D is not encoded, it is not clear how to
extract from it the failure detection properties of VV. Consequently, if D is given
in isolation, the task of transforming it into W may not be possible.
Fortunately, since D can be used to solve Consensus, there is a corresponding algorithm,
Consensusv, that is somehow able to "decode" the information about failures provided
by D, and knows how to use it to solve Consensus. Our reduction algorithm, Tv?w uses
Consensusv to extract this information from D and obtain VV.
2 The model
We consider asynchronous distributed systems in which there is no bound on message
delay, clock drift, or the time necessary to execute a step. Our model of asynchronous
computation with failure detection is patterned after the one in [FLP85j. The system
consists of a set of n processes, II = f?i ....... ,pn J. Every pair of processes is connected
by a reliable communication channel.
To simplify the presentation of our model, we assume the existence of a discrete global
clock. This is merely a fictional device: the processes do not have access to it. We take
the range T of the clock's ticks to be the set of natural numbers.
2.1 Failures and failure patterns
Processes can fail by crashing, i.e., by prematurely halting. A failure pattern F is a
function from T to 2?, where F(t) denotes the set of processes that have crashed through
time t. Once a process crashes, it does not "recover", i.e., Vt : F(t) C F(t + 1). We
define crashed(F) = UtET F(t) and correct(F) = II --H crashed(F). If p E crashed(F) we
say p crashes in F and if p ? correct(F) we say p is correct in F. We consider only
failure patterns F such that at least one process is correct, i.e., correct(F) # ?.
2.2 Failure detectors
Informally, a failure detector provides (possibly incorrect) information about the failure
pattern that occurs in an execution. Associated with each failure detector is a (possibly
infinite) range 1? of values output by that failure detector. A failure detector history H
with range ? is a function from II x T to 1?. H(p, t) is the value of the failure detector
module of process p at time t. A failure detector D is a function that maps each failure
pattern F to a set of failure detector histories with range 1zv (where 1zv denotes the
range of failure detector outputs of D). D(F) denotes the set of possible failure detector
histories permitted by D for the failure pattern F.
For example, consider the failure detector W mentioned in the introduction. Each
failure detector module of >V outputs a set of processes that are suspected to have
crashed: in this case 1z? = 2?. For each failure pattern F, W(F) is the set of all failure
detector histories H?? with range 1z??? that satisfy the following properties:
6
1. There is a time after which every process that crashes in F is always suspected by
some process that is correct in F:
?t E T, Vp E crashed(F), ?q E correct(F), Vt' > 1: p E H??(q, t')
2. There is a time after which some process that is correct in F is never suspected by
any process that is correct in F:
?t E T, ?p E correct(F), vq E correct(F), Vt'> t : p ? H?v(q, t')
Note that we speefy a failure detector D as a function of the failure pattern F of
an execution. However, this does not preclude an implementation of D from using other
aspects of the execution such as when messages are received. Thus, executions with the
same failure pattern F may still have different failure detector histories. It is for this
reason that we allow D(F) to be a set of failure detector histories from which the actual
failure detector history for a particular execution is selected non-deterministically.
2.3 Algorithms
We model the asynchronous communication channels as a message buffer which contains
messages of the form (p, data, q) indicating that process p has sent data addressed to
process q and q has not yet received that message. An algorithm A is a collection of
n (possibly infinite state) deterministic automata, one for each of the processes. A(p)
denotes the automaton running on process p. Computation proceeds in steps of the given
algorithm A. In each step of A, process p performs atomically the following three phases:
Receive phase: p receives a single message of the form (q, data, p) from the message
buffer, or a "null" message, denoted A, meaning that no message is received by p
during this step.
Failure detector query phase: p queries and receives a value from its failure detector
module. We say that p sees a value d when the value returned by p's failure detector
module is d.
Send phase: p changes its state and sends a message to all the processes according to
the automaton A(p), based on its state at the beginning of the step, the message
received in the receive phase, and the value that p sees in the failure detector query
phase.4
4In the send phase, p sends a message to all the processes atomically. As was shown in [FLP85J,
the ability to do so is not sufficient for solving Consensus. An alternative formulation of a step could
restrict a process to sending a message to a single process in the send phase. We can show that both
formulations are equivalent for our purposes (see Section 7.1).
7
The message actually received by the process p in the receive phase is chosen non-
deterministically from amongst the messages in the message buffer destined to p, and
the null message A. The null message may be received even f there are messages in the
message buffer that are destined to p: the fact that m is in the message buffer merely
indicates that rn was sent to p. Since ours will be a model of asynchronous systems,
where messages may experience arbitrary (but finite) delays, the amount of time m
may remain in the message buffer before it is received is unbounded. Indeed, our
model will allow a message sent later than another to be received earlier than the other.
Though message delays are arbitrary, we also want them to be finite. We model this
by introducing a liveness assumption: every message sent will eventually be received,
provided its recipient makes "sufficiently many" attempts to receive messages. All this
will be made more precise later.
We also remark that the non-determinism arising from the choice of the message to be
received reflects the asynchrony of the message buffer --H it is not due to non-deterministic
choices made by the process. The automaton A(p) is deterministic in the sense that the
message that p sends in a step and p's new state are uniquely determined from the
present state of p, the message p received during the step and the failure detector value
seen by p during the step.
To keep things simple we assume that a process p sends a message m to q at most
once. This allows us to speak of the contents of the message buffer as a set, rather than
a multiset. We can easily enforce this by adding a counter to each message sent by p to
q so this assumption does not damage generality.
2.4 Configurations, runs and environments
A configuration is a pair (s, M), where 8 is a function mapping each process p to its local
state, and M is a set of triples of the form (q, data, p) representing the messages presently
in the message buffer. An initial configuration of an algorithm A is a configuration (s, M),
where s(p) is an initial state of A(p) and M = ?. A step of a given algorithm A transforms
one configuration to another. A step of A is uniquely determined by the identity of the
process p that takes the step, the message m received by p during that step, and the
failure detector value d seen by p during the step. Thus, we identify a step of A with a
tuple (p, m, d, A). If the message received in that step is the null message, then m = A,
otherwise m is of the type (--H, --H,p). We say that a step e = (p, m, d, A) is applicable to
a configuration C = (s, M) if and only if m E M u fAJ. We write e(C) to denote the
unique configuration that results when e is applied to C.
A schedule S of algorithm A is a finite or infinite sequence of steps of A. S1 denotes
the empty schedule. We say that a schedule S of an algorithm A is applicable to a
configuration C if and only if (a) 5 = 5j, or (b) S[1J is applicable to C, S[2J is applicable
to S[lj(C), etc (we denote by v[i] the ith element of a sequence v). If 5 is a finite schedule
applicable to C, 5(C) denotes the unique configuration that results from applying 5 to
C. Note 51(C) = C for all configurations C. Let C be any configuration and 5 be any
8
schedule applicable to C. We say that C' is a configuration of the pair (5, C) if there is
a finite prefix 5' of 5 such that C' =
A partial run of algorithm A using a failure detector D is a tuple R = (F, Hv, I, 5, T)
where F is a failure pattern, Hv E D(F) is a failure detector history, I is an initial
configuration of A, 5 is a finite schedule of A, and T is a finite list of increasing time
values (indicating when each step in 5 occurred) such that 151 = ITI, 5 is applicable to
I, and for all i < 151, if 5[ij is of the form (p, m, d, A) then:
o+ p has not crashed by time T[i], i.e., p ? F(T[ij)
o+ d is the value of the failure detector module of p at time T[ij, i.e., d = Hv(p, T[i])
Informally, a partial run of A using D represents a point of an execution of A using D.
A run of an algorithm A using a failure detector D is a tuple R = (F, Hv, 1, 5, T)
where F is a failure pattern, Hv E D(F) is a failure detector history, 1 is an initial
configuration of A, 5 is an infinite schedule of A, and T is an infinite list of increasing
time values indicating when each step in 5 occurred. In addition to satisfying the above
properties of a partial run, a run must also satisfy the following properties:
0 Every correct process takes an infinite number of steps in 5. Formally:
VP Ei correct(F), Vi, aj > i : 5[j] is of the type(p, --H, --H, A)
o+ Every message sent to a correct process is eventually received. Formally:
VP E correct(F), VC = (s, M) of (5,1): m = (q, data, p) E M ?
: 5[i] is of the type (p, m, --H,A))
In [CT91], we proved that any algorithm that uses )`V to solve Consensus requires
n > 2f. With other failure detectors the requirements may be different. For example,
there is a failure detector that can be used to solve Consensus only if Pi and P2 do not both
crash. In general whether a given failure detector can be used to solve Consensus depends
upon assumptions about the underlying "environment". Formally, an environment S (of
an asynchronous system) is set of possible failure patterns.5
5In a synchronous system, assumptions about the underlying environment may also include other
characteristics such as the relative process speeds, the maximum message delay, the degree of clock
synchronization, etc. In such a system, a more elaborate definition of an environment would be required.
3 The Consensus problem
9
In the Consensus problem, each process p has an initial value, 0 or 1, and must reach an
irrevocable decision on one of these values. Thus, the algorithm of process p, A(p), has
two distinct initial states a0? and ?1? signifying that p's initial value is 0 or 1. A(p) also
has two disjoint sets of decision states ?0? and ??1.
We say that algorithm A uses failure detector D to solve Consensus in environment
C if every run R = (F, Hv, 1,5, T? of A using D where F E ? satisfies:
Termination: Every correct process eventually decides some value. Formally:
Vp E correct(F), ac = (s, M) of (5,1): s(p) E ??o U
Irrevocability: Once a correct process decides a value, it remains decided on that value.
Formally, let S[1..i] be the prefix of 5 consisting of the first i elements of 5:
Vp E correct(F),Vk E fO, lJ,Vi < i':
(5[1..iJ(I) = (s,M) A 5[1..i'](I) = (s',M') A s(p) E ??k) ? s'(p) E ??k
Agreement: No two correct processes decide differently. Formally:
Vp,p' E correct(F),VC= (s,M) of (5,I),Vk,k'Ei fo,lJ
E ??k A s(p') E ? k =
Validity: If a correct process decides v, then v was proposed by some process. Formally,
let I = (5o, Mo):
Vpc correct(F),Vk? fo,11 : (?C= (s,M) of (5,1):
s(p) E ??k) ? (?q E II: so(q) --H
4 Reducibility
We now define what it means for an algorithm Tv?vi to transform a failure detector
D into another failure detector V' in an environment E. Algorithm Tv?v' uses D to
maintain a variable output? at every process p. This variable, reflected in the local state
of p, emulates the output of V' at p. Let 0J? be the history of all the output variables
in run R, i.e., 0R(p, t) is the value of output, at time tin run R. Algorithm Tv?vi
transforms D into V' in S if and only if for every run R = (F, Hv, I, 5, T) of Tv?vi
using V, where F E g, 0R ?
10
Tv?vi
Algorithm B uses V'
V' emulated
Figure 1: Transforming V into V'
Given the reduction algorithm Tv?v', anything that can be done using failure de-
tector V' in environment S, can be done using V instead. To see this, suppose a given
algorithm B requires failure detector V' (when it executes in ?), but only V is avail-
able. We can still execute B as follows. Concurrently with B, processes run Tv?v' to
transform V into V'. We modify Algorithm B at process p as follows: whenever p is
required to query its failure detector module, p reads the current value of outpttt, (which
is concurrently maintained by Tv?v') instead. This is illustrated in Figure 1.
Intuitively, since Tv?vi is able to use V to emulate V', V provides at least as much
information about process failures in g as V' does. Thus, if there is an algorithm Tv?vs
that transforms V into V' in S, we write V ?? V' and say that V' is reducible to V in
we also say that V' is weaker than V in E. Clearly, the reducibility relation ?? is
transitive.
Note that, in general, Tv?i' need not emulate all the failure detector histories of
V' (in environment ?); what we do require is that all the failure detector histories it
emulates be histories of V' (in that environment).
5 An outline of the result
In [CT91J we showed that W can be used to solve Consensus in any environment in which
n > 2f. We now show that W is weaker than any failure detector that can be used to
solve Consensus. This result holds for any environment S. Together with [CT91J, this
implies that VV is indeed the weakest failure detector that can be used to solve Consensus
in any environment in which n > 2f.
11
To prove our result, we first define a new failure detector, denoted ?, that is at least
as strong as W. We then show that any failure detector D that can be used to solve
Consensus is at least as strong as ?. Thus, D is at least as strong as
The output of the failure detector module of ? at a process p is a single process, q,
that p currently considers to be correct; we say that p trnsts q. In this case, 1?? = II.
For each failure pattern F, ?(F) is the set of all failure detector histories H? with range
l?n that satisfy the following property:
o+ There is a time after which all the correct processes always trust the same correct
process:
Ei T, ?q E correci(F), VP E correct(F), Vt'> 1: H?(p, t') = q
As with W, the output of the failure detector module of ? at a process p may change
with time, i.e., p may trust different processes at different times. Furthermore, at any
given time t, processes p and q may trust different processes.
Theorem 1: For all environments ?, ? _
PROOF: [Sketch] The reduction algorithm T??? that transforms ? into )`V is as follows.
Each process p periodically sets output, II--H [qJ, where q is the process that p currently
trusts according to ?. It is easy to see that (in any environment S) this output satisfies
the two properties of >V.			0
Theorem 2: For all environments S, if a failure detector D can be used to solve
Consensus in ?, then D ??
PROOF: The reduction algorithm Tv?? is shown in Section 6. It is the core of our result.
0
Corollary 3: For all environments C, if a failure detector D can be used to solve Con-
sensus in C, then D ??
PROOF: If D can be used to solve Consensus in C, then, by Theorem 2, D ?? ?. From
Theorem 1, ? ?8 W By transitivity, D ?? ?.			0
In [CT9i] we proved that, for all environments E in which n > 2f, W can be used to
solve Consensus. Together with Corollary 3, this shows that:
Theorem 4: For all environments ? in which n > 2f, W is the weakest failure detector
that can be used to solve Consensus in C.
6 The reduction algorithm
6.1 Overview
12
Let g be an environment, D be a failure detector that can be used to solve Consensus in
?, and Consensusv be the Consensus algorithm that uses D. We describe an algorithm
Tv?? that transforms D into ? in S. Intuitively, this algorithm works as follows.
Fix an arbitrary run of Tv?n using D, with failure pattern F E S and failure detector
history Hv E D(F). Processes periodically query their failure detector V and exchange
information about the values of Hv that they see in this run. Using this information,
processes construct a directed acyclic graph (DAG) that represents a "sampling" of fail-
ure detector values in Hv and some temporal relationships between the values sampled.
To illustrate this, suppose process qi queries V, sees value di, and then sends the message
[qi, d1j to all processes. When a process q? receives [qi, di] it can add vertex [qi, dij to
its (current) version of the DAG: This vertex indicates that q? saw di. When q? later
queries V and sees the value d2, it can add vertex [q?, d2j and edge [qi, di] [q?, d2J to
its DAG: This edge indicates that qi saw di before before q? saw d2. By periodically
sending its current version of the DAG to all processes, and incorporating all the DAGs
that it receives into its own DAG, every correct process can construct ever increasing
finite approximations of the same (infinite) limit DAG G.
It turns out that DAG G can be used to simulate runs of Consensusv with failure
pattern F and failure detector history Hv. These are runs that could have occurred if
processes were running Consensusv instead of Tv??.
To illustrate this, consider a path of C, say [qi, dij, [q?, d2], [q?, ....... We can use this
path to simulate schedules of runs of Consensusv in which qi takes the first step and
sees failure detector value di, q? takes the second step and sees d2, q3 takes the third
step and sees d3, etc. Note that many such schedules of Consensusv are possible: For
each step, we have a choice of which message to receive either one of the messages
contained in the simulated buffer (i.e., a message previously sent but not yet received)
or the empty message.
Now consider any initial configuration 1 of Consensusv. The set of simulated sched-
ules of Consensusv that are "compatible with" some path of G (as explained above) and
are also applicable to 1 can be organized as a tree: paths represent simulated runs of
Consensusv with initial configuration 1, and branching occurs at the points where simu-
lated runs diverge. By considering several initial configurations of Consensusv, we obtain
a forest of simulated runs of Consensusv: a tree for each different initial configuration.
Thus, the (infinite) DAG G induces an (infinite) simulation forest T of runs of
Consensusv with failure pattern F and failure detector history Hv. Using T, we show
that it is possible to extract the identity of a process p* that is correct in F, and we give
the extraction algorithm.
The simulation forest T, however, is infinite and cannot be computed by any process.
Fortunately, the information needed by the extraction algorithm to identify p* is present
13
in a "crucial" finite subgraph of T that processes are able to eventually compute. When
running Tv?n, each process p constructs ever increasing finite approximations of the
DAG &. Using these approximations, p also constructs ever increasing finite approxi-
mations of T that eventually include the crucial subgraph needed to extract p*. At all
times, p runs the extraction algorithm on its present finite approximation of T to select
some process that it considers to be correct: once p's approximation of T includes the
crucial subgraph, the extraction algorithm will select p* (forever). Thus, there is a time
after which all correct processes trust the same correct process, p* which is exactly
what ? requires.
Having given an overall account of how the transformation of D to ? works, we
now provide a roadmap for the rest of this section. We first define the DAGs G that
allow us to induce an infinite simulation forest T (Section 6.2), and to extract a correct
process from T (Sections 6.3--H6.5). We then show how processes compute ever increasing
approximations of such a G and corresponding T (Section 6.6.1) . Finally, we show
that by periodically extracting a process from their current finite approximation of T,
all correct processes will eventually keep extracting (forever) the same correct process
(Section 6.6.2).
We now state some conventions that simplify the discussion that follows. We say that
a process is correct (crashes) if it is correct (crashes) in F. For simplicity, we assume
that a process p sees a value d at most once (this can be enforced by tagging a counter to
each value seen). For the rest of this paper, whenever we refer to a run of Consensusv,
we mean a run of Consensusv using D. Furthermore, we only consider schedules of
Consensnsv, and so we write (p, m, d) instead of (p, m, d, Consensusv) to denote a step.
6.2 A DAG and a forest
Let D be a failure detector that can be used to solve Consensus in environment ?. Given
an arbitrary failure pattern F E g, and failure detector history Hv e D(F), let G be
any infinite DAG with the following properties:
1.
The vertices of G are of the form [p,(11 where p ? II and d ? ?v. If [p,d] is a
vertex of G, then there is a time t such that p ? F(t) and d = Hv(p,t) (i.e., at
time t, p has not crashed and the value of p's failure detector module is d).
2. If [qi,di] [q2,d2] is an edge of 0 and d1 = Hv(qi,ti) and d2 = Hv(q2,t2) then
ti <t2.
3. 0 is transitively closed.
4.
Let V be any finite subset of vertices of & and p be any correct process. There is
a failure detector value d such that for all vertices [p',d'J in V, [p',d'] H [p,d] is an
edge of 0.
14
Note that G contains only a "sampling" of the failure detector values that occur in Hv,
and only a subset of the temporal relationships that relate them. In other words, we do
not require that G contain all the values that occur in Hv, or that it relate (with an
edge) all its values according to the time at which they occur in Hv. However, Property 4
implies that G contains infinitely many "samplings" of the failure detector module of
each correct process.
Let g = [qi, d1], [q?, ....... be any (finite or infinite) path of 0. A schedule 5 is
compatible with g if it has the same length as g, and 5 = (q1,m1,d1),(q2,m2,d2),...,
for some (possibly null) messages m1, m2,. We say that 5 is compatible with 0 if it is
compatible with some path of 0.
Let 1 be any initial configuration of Consensusv. Consider a schedule 5 that is
compatible with 0 and applicable to 1. Intuitively, 5 is the schedule of a possible run of
Consensusv with initial configuration 1, failure pattern F, and failure detector history
Hv.
We can represent all the schedules that are compatible with 0 and applicable to 1
as a tree. This is called the simulation tree ?G1 induced by 0 and I and is defined as
follows. The vertices of ?G1 are the finite schedules 5 that are compatible with 0 and are
applicable to I. The root of Tk is the empty schedule 5j. There is an edge from vertex
5 to vertex 5' if and only if 5' = 5. e for a step e;6 this edge is labeled e. With each
(finite or infinite) path in TG1, we associate the unique schedule 5 = e1,e2,... .......
consisting of the sequence of labels of the edges on that path. Note that if a path starts
from the root of ?G1 and it is finite, the schedule 5 associated with it is also the last
vertex of that path.
The following two lemmata make precise the connection between paths of TG1 and
runs of Consensusv. The proofs, which follow directly from the definitions, are included
in the Appendix.
Lemma 5: Let 5 be a schedule associated with a finite path of TG1 that starts from
the root. There is a sequence of times T such that (F, Hi, I, 5, T? is a partial run of
Consensusv.
Lemma 6: Let 5 be a schedule associated with an infinite path of T1G that starts
from the root. If in 5 every correct process takes an infinite number of steps and every
message sent to a correct process is eventually received, there is a sequence of times T
such that (F,Hv,I,5,T') is a run of Consensusi.
The following lemmata state some "richness" properties of the simulation trees in-
duced by 0 (their proofs are in the Appendix).
Lemma 7: For any two initial configurations I and I', if 5 is a vertex of ?c1 and is
applicable to I' then 5 is also a vertex of TG".
6If u, w are sequences and u is finite then u w denotes the concatenation of the two sequences.
15
Lemma 8: Let 5 be any vertex of TG' and p be any correct process. Let m be a message
in the message buffer of 5(1) addressed to p or the null message. For some d, 5 has a
child 5 (p,m,d) in ?1G
Lemma 9: Let 5 be any vertex of ?G1 and p be any process. Let m be a message in
the message buffer of 5(1) addressed to p or the null message. Let 5' be a descendent
of 5 such that, for some d, 5' (p, m, d) is in T?1. For each vertex 5" on the path from
5 to 5' (inclusive), 5" (p, m, d) is also in IG'.
Lemma 10: Let 5, 5o, and Si be any vertices of TG'. There is a finite schedule E
containing only steps of correct processes such that:
1. 5 E is a vertex of ?c' and all correct processes have decided in 5
2. For i = 0,1, if E is applicable to S?(I) then 5? E is a vertex of TG'.
Let 1? 0 < i < n denote the initial configuration of Consensusv in which the initial
values of Pi.. p? are 1, and the initial values of ....... p? are 0. The simulation forest
induced by G is the set ?T?,T?'1,... , TG'?J of simulation trees induced by G and initial
configurations 10,11,... , In.
6.3 Tagging the simulation forest
We assign a set of tags to each vertex of every tree in the simulation forest induced by
C. Vertex 5 of tree ?G1 gets tag k if and only if it has a descendent 5' (possibly 5' = 5)
such that some correct process has decided k in 5'(I). Hereafter, T? denotes the tagged
tree T0?, and T denotes the tagged simulation forest (T0, ..... . ,
Lemma 11: Every vertex of T? has at least one tag.
PROOF: From Lemma 10, every vertex 5 of T? has a descendent 5' = 5 E (for some
E)			such that all correct processes have decided in 5'(P').			0
A vertex of T? is monovalent if it has only one tag, and bivalent if it has both tags, 0 and
1. A vertex is 0-valent if it is monovalent and is tagged 0; 1-valent is similarly defined.
Lemma 12: Every vertex of T? is either 0-valent, 1-valent, or bivalent.
PROOF: Immediate from Lemma 11.			0
Lemma 13: The ancestors of a bivalent vertex are bivalent. The descendents of a
valent vertex are k-valent.
PROOF: Immediate from the definitions.			0
16
Lemma 14: If vertex 5 of T? has tag k, then no correct process has decided 1 --H k in
5(Ii).
PROOF: Since 5 has tag k, it has a descendent 5' such that a correct process p has
decided k in 5I(1i). From Lemma 5, there is a T such that R = (F,Hv,P,5',T')
is a partial run of Consensnsv. Since p has decided k in 5t(1i), from the agreement
requirement of Consensus, no correct process has decided 1 --H k in 5t(1i) Since 5' is a
descendent of 5, from the irrevocability requirement of Consensus, no correct process
could have decided 1 --H k in 5(li)			0
Lemma 15: If vertex 5 of Ti is bivalent, then no correct process has decided in 5(1t).
PROOF: Immediate from Lemma 14.			0
Recall that in 10 all processes have initial value 0, and in I? they all have initial value 1.
Lemma 16: The root of T0 is 0-valent; the root of T? is 1-valent.
PROOF: We first show that the root of T0 is 0-valent. Suppose, for contradiction, that
the root of T0 has tag 1. There must be a vertex 5 of T0 such that some correct process
has decided 1 in 5(10). From Lemma 5, there is a T such that R = (F, Hv, 10,5, T?
is a partial run of Consensusv. R violates the validity requirement of Consensus a
contradiction. Thus the root of T0 cannot have a tag of 1. From Lemma 11, the root of
T0 has at least one tag: thus it is 0-valent.
By a symmetric argument, the root of T? is 1-valent. 0
Index i is critical if the root of T? is bivalent, or if the root of T?--H1 is 0-valent while the
root of T? is 1-valent. In the first case, we say that index i is bivalent critical; in the
second case, we say that i is monovalent critical.
Lemma 17: There is a critical index i, 0 < i < n.
PROOF: Apply Lemmata 12 and 16 to the roots of T0, ..... . , T? 0
The critical index i is the key to extracting the identity of a correct process. In fact, if i
is monovalent critical, we shall prove that p? must be correct (Lemma 19). If i is bivalent
critical, the correct process will be found by focusing on the tree T?, as explained in the
following section.
6.4 Of hooks and forks
17
We describe two types of finite subtrees of T? referred to as decision gadgets of T?. Each
type of decision gadget is rooted at the root Sj of T? and has exactly two leaves: one
0-valent and one 1-valent. The least common ancestor of these leaves is called the pivot.
The pivot is clearly bivalent.
f0J
S.(p,m,d)
Figure 2: A fork p is the deciding process
The first type of decision gadget, called a fork, is shown in Figure 2. The two leaves
are children of the pivot, obtained by applying different steps of the same process p.
Process p is the deciding process of the fork, because its step after the pivot determines
the decision of correct processes.
?0J
0S.e
Root
o
SI			(pS??0?			?1J
S'=S.(p,m,d)
Figure 3: A hook p is the deciding process
The second type of decision gadget, called a hook, is shown in Figure 3. Let 5 be
the pivot of the hook. There is a step e such that 5 e is one leaf, and the other leaf is
m, d).e for somep, m, d. Process p is the deciding process of the hook, because the de-
cision of correct processes is determined by whether p takes the step (p, m, d) before e.7
7A fork may be a subgraph of a hook.
18
5 `. 5j			?5j is the bivalent root of T1J
repeat forever
Let p be the next correct process in round-robin order
Let m be the oldest message addressed to p in the message buffer of 5(li)
(if no such message exists, m = A)
if 5 has a descendent 5' (possibly 5' = 5) such that
for some d, 5'. (p, ni, d) is a bivalent vertex in T?
then 5			5' (p,m,d)			?5 is biva1en?
else exit
Figure 4: Generating path ir in T?
We shall prove that the deciding process p of a decision gadget, whether a fork or
a hook, must be correct (Lemma 21). Intuitively, this is because if p crashes, then no
process can figure out whether p has taken the step that determines the decision value;
indeed, this is so even though processes can consult the failure detector D. Thus, if p
crashes, then no process can decide contradicting the correctness of Consensusv
Lemma 18: If index i is bivalent critical then T? has at least one decision gadget (and
hence a deciding process).
PROOF: Starting from the bivalent root of T?, we generate a path 7r in T?, all the
vertices of which are bivalent, as follows. We consider all correct processes in round-
robin fashion. Suppose we have generated path 5 so far, and it is the turn of process p.
Let m be the the oldest message destined to p that is in the message buffer of 5(1i).S (If
no such message exists, we take m to be the null message.) We try to extend the path 5
so that the last edge in the extension corresponds to p receiving m and the target of that
edge is a bivalent vertex. The path construction ends if and when such an extension is
no longer possible. This construction is shown in Figure 4. Each iteration of the loop
extends the path by at least one edge. Let 7r be the path generated by these iterations;
ir is finite or infinite depending on whether the loop terminates.
Claim 1: r is finite.
PROOF: Suppose, for contradiction, that ir is infinite. Let 5 be the schedule associated
with ir. By construction, in 5 every correct process takes an infinite number of steps and
every message sent to a correct process is eventually received. By Lemma 6, there is a
T such that R = (F, Hv, I?, 5, T) is a run of Consensusv. By construction, all vertices
in 7r are bivalent. By Lemma 15, no correct process decides in R, thus violating the
termination requirement of Consensus--Ha contradiction. 0claim 1
8By a slight abuse of notation we identify a finite path from the root and its associated schedule.
19
Let 5 be the last vertex of ? (clearly, 5 is bivalent). Let p be the next correct process
in round-robin order when the loop in Figure 4 terminates. Let m ?e the oldest message
addressed to p in the message buffer of 5(li) (if no such message exists, m is the null
message). The loop exit condition and Lemma 12 imply that:
For all descendents 5' of 5 (including 5' = 5), 5' (p, m, --H) is monovalent. (*)
From Lemma 8, for some d, 5 has a child 5 (p,m,d) in T?. By (*), 5. (p,rn,d) is
monovalent. Without loss of generality, assume it is 0-valent.
Claim 2: For some d' there is a descendent 5' of 5 such that 5' (p, m, d') is a 1-valent
vertex of T?, and the path from 5 to 5' contains no edge labeled (p, m,
PROOF: Since 5 is bivalent, it has a descendent 5? such that some correct process has
decided 1 in 5*(1i). From Lemmata 11 and 14, 5* is 1-valent. There are two cases:
1.
2.
The path from 5 to 5* does not have an edge labeled (p, m, --H). Suppose m ? A.
Since m is in the message buffer of 5(li) and p does not receive m in the path
from 5 to 5*, m is still in the message buffer of 5*(1i). From Lemma 8 (which
also applies if m = A), for some d', 5*. (p, m, d') is in T?. Since 5* is 1-valent, by
Lemma 13, 5*. (p, m, d') is also 1-valent. In this case, the required 5' is 5?.
The path from 5 to 5* has an edge labeled (p, m, --H). Let (p, m, d') be the first
such edge on that path. Let 5' be the source of this edge. By (*), 5'. (p, m,
is monovalent. Since 5,. (p, m, d') has a 1-valent descendent 5*, by Lemma 13,
5'. (p, m, d') is 1-valent.			0claim 2
Consider the vertex 5' and edge (p, m, d') of Claim 2. By Lemma 9, for each vertex 5"
on the path from 5 to 5' (inclusive), 5". (p, m, d') is also in T?. By (*), all such vertices
5". (p, m, d') are monovalent. In particular, 5. (p, m, d') is monovalent. There are two
cases (see Figure 5):
1. 5. (p, m, d') is 1-valent. Since 5. (p, m, cl) is 0-valent, T? has a fork with pivot 5.
2. 5 (p, m, cl') is 0-valent. Recall that 5'. (p, m, cl') is 1-valent and for each vertex
5" between 5 and 5', 5". (p, rn,d') is monovalent. Thus, the path from 5 to 5'
must have two vertices 5o and Si such that 5o is the parent of Si, 5o (p, m, cl') is
0-valent and Si (p, m, cl') is 1-valent. Hence, T? has a hook with pivot 5o 0
20
0-valent
Fork (for case 1)
g51
m,			? (bivalent)
0
`?Q 5o (pivot of hook)
,d') oSi
0-valent
(p, m,d')"
1-valent			`A
1-valent
Hook (for case 2)
Figure 5: The decision gadgets in T? if i is bivalent critical
6.5 Extracting the correct process
i. If i is monovalent critical, Lemma 19 below
If? is bivalent critical, a correct process can be
By Lemma 17, there is a critical index
shows how to extract a correct process.
found by applying Lemmata 18 and 21.
Lemma 19: If index i is monovalent critical then p? is correct.
PROOF: Suppose, for contradiction, that pj crashes. By Lemma 10(1) (applied to the
root 5 = Si of Ti), there is a finite schedule E that contains only steps of correct
processes (and hence no step of p?) such that all correct processes have decided in E(It).
Since index i is monovalent critical, the root Sj of T? is 1-valent. Hence all correct
processes must have decided 1 in E(I?).
I? and 1i-i only differ in the state of p?. Since 5 is applicable to I? and does not
contain any steps of p?, an easy induction on the number of steps in 5 shows that: (a) 5
is also applicable to 1i-i, and (b) the state of all processes other than p? are the same in
5(li) and 5(Ji?1) Using Lemma 7, (a) implies that 5 is also a vertex of yi-i By (b),
all correct proces?es have decided 1 in 5(P?i) Thus the root of Ti-1 has tag 1. Since z
is monovalent critical, the root of T?--H1 is 0-valent a contradiction. 0
21
ysi
si (1?valent)9k'
E
(bivalent)
5o (0-valent)
5 E (Correct processes assumed to have
decided 0)
Figure 6: Lemma 20
Lemma 20: Let 5 be any bivalent vertex of T?, and 5o, Si be any 0-valent and 1-valent
descendents of 5. If there is a process p such that the paths from 5 to 5o and from 5 to
Si contain only steps of the form (p, --H, --H), then p is correct.
PROOF: Suppose, for contradiction, that p crashes. From Lemma 10, there is a schedule
E containing only steps of correct processes (and hence no step of p) such that:
i. 5 E is a vertex of T? and all correct processes have decided in 5
ii. For k = 0,1, if 5k E is applicable to I? then 5k E is a vertex of T?.
Without loss of generality assume that all correct processes decided 0 in 5
Refer to Figure 6. Since all steps in the path from 5 to 5i are steps of p, the state of
every process other than p is the same in 5(li) and in 51(li). Furthermore, any message
addressed to a process other than p that is in the message buffer in 5(li) is still in the
message buffer in 51(li). Since E is applicable to 5(li) and does not contain any steps
of p, an easy induction on the number of steps in E shows that: (a) E is also applicable
to 51(li), and (b) the state of every process other than p is the same in 5. E(P) and
Si E(P). By (ii), (a) implies that Si E(P) is a vertex in T?. By (b), all correct
processes decide 0 in Si E(P). So Si, has tag 0. But Si is 1-valent a contradiction.O
Lemma 21: The deciding process of a decision gadget is correct.
PROOF: Let ? be any decision gadget of T?. There are two cases to consider:
1. ? is a fork. By Lemma 20, the deciding process of ? is correct.
22
So =5
(0-valent)
ysi
6' s (bivalent)
5' = 5 (p,m,d)
ys
(1-valent)
5o?E 6
6' 5?
Figure 7: Lemma 21
2.
is a hook. Assume (without loss of generality) that 5 is the pivot of ?, 5o --H
5 (p', m', d') is the 0-valent leaf of ? and Si = 5 (p, m, d) . (p', m', d') is the 1-valent
leaf of ? (see Figure 7). There are two cases:
(a) p = p'. By Lemma 20, p is correct.
p # p'. Suppose, for contradiction, that p crashes. By Lemma 10, there is a
schedule E containing only steps of correct processes (and hence no step of
p) such that:
1.
11.
5o E is a vertex of T? and all correct processes have decided in 5o
Since 5o is 0-valent, all correct processes must have decided 0 in 5o E(11).
If E is applicable to 51(Ji) then Si E is a vertex of T?.
Let 5' = 5 (p, m, d) be the parent of St. The state of every process other than
p is the same in 5(1i) and 51(1i). Furthermore, any message addressed to a
process other than p that is in the message buffer in 5(Ji) is still in the message
buffer in 51(Ji). Therefore, since 5o = 5. (p', m', d') and St = 5' (p', m', d'),
the state of every process other than p is the same in 50(Ji) and 51(li). In
addition, any message addressed to a process other than p that is in the
message buffer in 50(Ji) is also in the message buffer in Si (1i) Since E is
23
applicable to 50(li) and does not contain any steps of p, an easy induction
on the number of steps in E shows that: (a) E is also applicable to 51(li),
and (b) the state of every process other than p is the same in 5o E(Ii) and
Si E(P). By (ii), (a) implies that Si E is a vertex of T?. By (b), all
correct processes decide 0 in Si E(I?). Thus Si, receives a tag of 0. But Si
is 1-valent?a contradiction.			0
There may be several critical indices and several decision gadgets in the simulation forest.
Thus, Lemmata 19 and 21 may identify many correct processes. Our selection rule will
choose one of these, as the failure detector ? requires, as follows. It first determines
the smallest critical index i. If i is monovalent critical, it selects p?. If, on the other
hand, i is bivalent critical, it chooses the "smallest" decision gadget in T? according to
some encoding of gadgets, and selects the corresponding deciding process. It is easy to
encode finite graphs as natural numbers. Since a decision gadget is just a finite graph,
the selection rule can use any such encoding. The whole method of selecting a correct
process is shown in Figure 8 (recall that C is any directed acyclic graph that satisfies
Properties 1--H4 of Section 6.2 with respect to the given failure pattern F).
fBuild and tag simulation forest T induced by GJ
fori?Q1,...,n:
Ti			simulation tree induced by C and 1?
for every vertex 5 of T?
if 5 has a descendent 5' such that a correct process has decided k in 51(1i)
then add tag k to 5
fSelect a process from tagged simulation forest TJ
& smallest critical index
if i is monovalent critical then return p?
else return deciding process of the smallest decision gadget in T?
Figure 8: Selecting a correct process
(1)
(2)
(3)
Theorem 22: The algorithm in Figure 8 selects a correct process.
PROOF: By Lemma 17, there is a critical index i, 0 < i < n. If i is monovalent critical,
Line 2 returns pi which, by Lemma 19, is correct. If i is bivalent critical, by Lemma 18,
T1 contains at least one decision gadget. Let ? be the decision gadget in Ti with the
smallest encoding. By Lemma 21, the deciding process of ? is correct in F. Thus, Line
3 returns the identity of a process that is correct. 0
6.6 The reduction algorithm Tv??
24
The selection of a correct process described in Figure 8 is not yet the distributed alg??
rithm Tv?n that we are seeking: it involves an infinite simulation forest and is "central-
ized". To turn it into a distributed algorithm, we will modify it as follows. Each process
will cooperate with other processes to construct ever increasing finite approximations of
the simulation forest. Such approximations will eventually contain the decision gadget
and the other tagging information necessary to extract the identity of the same correct
process chosen by the selection method in Figure 8.
Note that the selection method in Figure 8 involves three stages: The construction
of U, a DAG representing samples of failure detector values and some of their temporal
relationship; the construction and tagging of the simulation forest induced by U; and,
finally, the selection of a correct process using this forest.
Algorithm Tv?? consists of two components. In the first component, each process
repeatedly queries its failure detector module and sends the failure detector values it
sees to the other processes. This component enables correct processes to construct ever
increasing finite approximations of the same U. Since all inter-process communication
occurs in this component, we call it the communication component of Tv?n.
In the second component, each process repeatedly (a) constructs and tags the simu-
lation forest induced by its current approximation of U, and (b) selects the identity of
a process using its current simulation forest. Since this component does not require any
communication, we call it the computation component of Tv??.
6.6.1 The communication component
In this component processes cooperate to construct ever increasing approximations of
the same U. Let Up denote p's current approximation of U. Roughly speaking, each
process p periodically performs the following two tasks: (i) If p receives Uq for some q, it
incorporates this information by replacing Up with the union of Up and Uq. (ii) Process
p queries its own failure detector module. Let d be the value that it sees and [p',d'] be
any vertex currently in Up. Clearly, p saw d after p' saw d'. Thus p adds [p,c1j to Up,
with edges from all other vertices of Up to [p,d]. Process p then sends its updated Up to
all other processes. The communication component of Tv?? for p is shown in Figure 9.
Let Up(t) denote the value of Up at time t. If p takes a step at time t, Up(t) denotes
the value of Up at the end of that step. The next two lemmata establish certain useful
properties of the graphs constructed by the communication component. In reading the
proofs of these results it will be useful to keep in mind that in our model the three phases
of a step receive, failure detection query, and send occur atomically at time t.
25
fBuild the directed acyclic graph UpJ
Up			empty graph
repeat forever
RECEIVE PHASE:
p receives m
FAILURE DETECTOR QUERY PHASE:
query failure detector D
SEND PHASE:
if m is of the form (q, 0q,p) then
Up			Up U Uq
add [p,dpj to Up and edges from all other vertices of 0p to [p,dpj
outputp . computation component ?Fi?gure JOJ
p sends (p,Up,q) to all q ? II
Figure 9: Process p's communication component
(1)
(2)
(3)
(4)
Lemma 23: Let v be a vertex contained in some local graph during the execution of
the communication component. Let Up(t) be the first graph that contains v. (That is,
v is in Up(t), but not in Uq(t'), for any process q and time t' <1.) Then:
1. v = [p,1J, and p saw d at time t.
2. If u .` v is an edge contained in some local graph during the execution of the
communication component (i.e., u v is in Uq(t'), for some process q and some
time t'), then u v is contained in Up(t).
3. Up(t) is a subgraph of any graph that contains v.
PROOF: 1. Process p adds v to Up(t) in Line (1) or (2). In the latter case, the result
follows immediately. In the former case, p must have received a message at time t with a
graph that contains v. The process that sent that message must have had v in its graph
before time t, contradicting the choice of Up(t) as the first graph to contain v.
2. Consider the earliest time 1' when the edge u v was added to some graph, say of
process q. By definition of t, t' > 1. If t' > t, at time t' process PI receives a message that
contains a graph with the edge u H v The sender of that message had a graph that
contained the edge u v at some time before t', contrary to the choice of t'. Therefore
it must be that t' = t. Then, by Part (1), q = p and so u v is in Up(t), as wanted.
3. Suppose, for contradiction, that some graph contains v but is not a supergraph of
Up(t). Choose the first such graph, say, Uq(t1). By definition of 1, t' > t. Clearly, q $ p
because p never removes any vertices or edges from its own graph. Therefore, at time
process q receives a message with a graph that contains v but is not a supergraph of
26
Gp(t). The sender of that message must have had a graph that contains v but is not a
supergraph of Gp(t) before time t', contrary to the choice of Gq(t'). 0
Recall that we are considering a fixed run of Tv??, with failure pattern F, and failure
detector history Hv E D(F). We now prove that the graphs constructed by the commu-
nication component of Tv?? satisfy certain properties. Note the similarity between the
first four and the four properties of the graphs defined in Section 6.2.
Lemma 24: For any correct process p and time t:
1. The vertices of Cp(t) are of the form [p',d? where p' E II and d' ? 1?v If [p',d'j is
a vertex of Gp(t), then there is a time t' such that p' ? F(t') and d' = Hv(p', t').
2. If [qi,d1] ) [q2,d2] is an edge of Up(t) and d1 = Hv(qi,ti) and d2 = Hv(q2,t2)
then t1 <t2.
3. Gp(t) is transitively closed.
4. There is a time t' > t and a failure detector value d such that for all vertices [p',d'J
of Gp(t), [p',d'] [p,d] is an edge of Gp(t').
5. Gp(t) is a subgraph of Cp(t'), for all t' > t.
6. For all correct q, there is a time t' > t such that Gp(t) is a subgraph of Gq(t').
PROOF:
Property 1 : Consider the first graph that contains the vertex [p',d']. By Lemma 23(1),
this graph is Upi (1') for some time t', and p' saw d' at time t'. Thus p' ?
(otherwise P' would not have taken a step at time t1 and would not have seen d1),
and d' = Hv(P1,t1), as wanted.
Property 2 : By Lemma 23(2), [q1,d1J H [q2,d2] is an edge of Gq2(t2). Let t1 be the
time when q? added vertex [q1, d1] to Gq? Clearly, t1 < 12. There are two cases:
1. t' < 12. By Lemma 23(1), [qi,dij was not in any graph before time Ii. Thus,
t1 < 1' and from the hypothesis of this case, l? ? 12.
2. t1 = t. Then q? received a graph containing [q1, d1] at 12. Let III be the time
when this graph was sent. Of course, t'1 < 12. By Lemma 23(1), [qi, d1J was
not in any graph before ti, and therefore Ii ? 1". Thus, Ii <12.
27
Property 3 : Let [q1, d1J			...			[qk, dkj be a path in Gp(t). We must show that there
is an edge [qi,dij H [q?,d?j in Cp(t).
Let tj be the time when q? inserted [q?, djj in 0q?? for 1 < i < k. By induction oniwe
show that [qi,d1J ... [q?,d?j is apath in Gqj(ti). The basis, i = 1, is trivial. For
the induction step, suppose that [qi, d1j .... [qi-i, di?1j is a path in Gqj?1 (ti?1).
Since [q??1, di?1] H ki, djj is an edge in Cp(t), by Lemma 23(2), it is also an edge
in Cqj(tj). Since ki?i,di?ii is a vertex in Gq?(tj), by Lemma 23(3), Gqj?1(ti?i) is
a subgraph of Gq?. (ti). In particular, Gq?(ti) contains the path [q1,d1J ...
[qi?i,di?1J. Thus, [q1,d1] ... [qi,dij is a path in Cqj(tj), as wanted.
Therefore, the vertices [q1,d1J,... , [qk, dkj are all in &?? (tk). At time tk, qk adds
an edge from every other vertex to [qk, dk]. Thus, the edge [qi, d1J [qk, dk] is in
Gq?(t?). By Lemma 23(3), Gq?(t?) is a subgraph of Gp(t) (since the latter contains
[qk, dkj). Therefore, [qi, d1J [qk, dk] is in Gp(1), as wanted.
Property 5 : Once a vertex or edge is added to Gp it is not removed.
Property 4 : Since p is correct, it takes a step at some time t' after t. In the failure
detector query phase of this step, p queries its failure detector module and obtains
a value, say d. In Line 2 of this step, p adds the vertex [p,dJ to Cp and an edge
from all other vertices of Cp(t') to [p,d]. From Property 5, Cp(t) is a subgraph of
Gp(t'), hence the result follows.
Property 6 : Since p is correct, it eventually sends Gp(t) to all processes, including
q (this occurs in p's first execution of Line 4 after time t). Since q is correct, it
eventually receives Gp(t), and then replaces 0q with Gg U Gp(t), say at time t'. So,
Cp(t) is a subgraph of Cq(1,).			0
Property 5 of the above lemma allows us to define Gp? = UtET Up (I). By Property 6:
Lemma 25: For any correct processes p and q, 9? =
PROOF: Let 0 be any vertex or edge of Cp?, i.e., there is a time t at which 0 is in Gp(t).
From Lemma 24 (6), there is a time t' such that Gp(t) is a subgraph of Gq(t'). Thus o
is in G?. Thus Gp? is a subgraph of Cq? By a symmetric argument, U? is a subgraph
of Q?, hence Gp? = UOOq			0
Lemma 25 allows us to define the limit graph C to be &p? for any correct process p. The
first four properties of Lemma 24 immediately imply:
Lemma 26: The limit graph G satisfies the four properties of the DAGs defined in
Section 6.2.
As before, T? denotes the tagged simulation tree induced by U and initial configuration
P, and T denotes the tagged simulation forest fY0, ..... . ,
28
fBuild and tag simulation forest T i
fori?Q1,...,n:			p nducedbyG?J
Tp? : simulation tree induced by Cp and I?
for every vertex S of T?1
if s has a descendent S' such that p has decided k in 5?(1i)
then add tag k to S
fSelect a process from tagged simulation forest TpJ
if there is no critical index then return p
else
z .` smallest critical index
if i is monovalent critical then return p?
else if Tp? has no decision gadgets then return p
else return deciding process of the smallest decision gadget in Tp?
Figure 10: Process p's computation component
(1)
(2)
(3)
6.6.2 The computation component
Since the limit graph G has the four properties of Section 6.2, we can apply the "central-
ized" selection method of Figure 8 to identify a correct process. This method involved:
o+ Constructing and tagging the infinite simulation forest T induced by G.
o+ Applying a rule to T to select a particular correct process p*.
In the computation component of Tv??, each process p approximates the above method
by repeatedly:
o+ Constructing and tagging the finite simulation forest Tp induced by Gp, its present
finite approximation of G.
o+ Applying the same rule to Tp to select a particular process.
Since the limit of Tp over time is T, and the information necessary to select p* is in a
finite subgraph of T, we can show that eventually p will keep selecting the correct process
p*, forever.
Actually, p cannot quite use the tagging method of Figure 8: that method requires
knowing which processes are correct! Instead, p assigns tag k to a vertex S in Tp? if and
only if s has a descendent S' such that p itself has decided k in S'(P). If p is correct,
this is eventually equivalent to the tagging method of Figure 8. If p crashes, we do not
care how it tags its forest. Also, p cannot use exactly the same selection method as that
of Figure 8: its current simulation forest Tp may not yet have a critical index or contain
29
any decision gadget (although it eventually will!). In that case, p temporizes by just
selecting itself. The computation component of Tv?o is shown in Figure 10 (compare it
with the selection method of Figure 8).
We first show that Tp, the simulation forest that p constructs, is indeed an increas-
ingly accurate approximation of T (Lemma 27). We then show that the tags that p gives
to any vertex 5 in Tp are eventually the same ones that the tagging rule of Figure 8 gives
to 5 in T (Lemma 28). Let Tp(t) denote T at
forest induced by Gp(t).			?			time t, i.e., Tp(t) is the finite simulation
Lemma 27: For any correct p and any time t:
1. Tp(t) is a subgraph9 of T
2. Tp(t) is a subgraph of Tp(t'), for all 1' > 1.
3. UTp(t)=T
tET
PROOF:
Property 1 : Let 5 be any vertex of tree Tp?(t) (for some i, 0 < i < n). From the
definition of Tp?(t), 5 is compatible with some path g of Cp(t) and applicable to
I?. Since Gp(t) is a subgraph of U, g is also a path of G. Thus, 5 is compatible
with G; since it is also applicable to I?, it is a vertex of T?.
Similarly, let 5 . 5' be an edge e of Tp?(t)., Since 5 and 5' are also vertices of T?,
and 5' = 5. e, 5 ) 5' is also an edge of Tt
Property 2 : Follows from Lemma 24 (5).
Property 3 : We first show that T is a subgraph of UtET Tp(t). Let 5 be any vertex of
any tree T? of T. From the definition of Tt, 5 is compatible with some finite path
g of G and is applicable to I?. Since G = UtET Cp(t) and g is a finite path of C,
there is a time t such that g is also a path of Cp(t). Since 5 is compatible with g
of Gp(t) and is applicable to I?, 5 is a vertex of Tp?(t)
Let 5 H 5' be any edge e of T?. By the argument above, there is a time t after
which both 5 and 5' are vertices of Tp? Since 5' = 5. e, after time t the edge e is
also in Tp? Thus, every vertex and every edge of T is also in UtET Tp(t), i.e., T is
a subgraph of UtET Tp(t). By Property 1, UtET Tp(t) = T. 0
Lemma 28: Let p be any correct process, and 5 be any vertex of Tp There is a time
after which the tags of 5 in Tp are the same as the tags of 5 in T.
PROOF: Suppose that at some time t, p assigns tag k to vertex 5 of tree Tp? Thus 5
9The subgraph and graph equality relations ignore the tags
30
has a descendent S' in Tp?(t) sud? that p has decided k in S'(J?). By Lemma 27(1), S'is
also a descendent of S in T?, and since p is correct, S has tag k in T? as well.
Conversely, suppose a vertex S of a tree T? of T has tag k. We show that, eventually,
p also assigns tag k to S in Tp? Since S has tag k in T?, s has a descendent S' in T?
such that some correct process has decided k in 51(1i) (cf. tagging rule in Figure 8).
By Lemma 10(1), there is a descendent 5" of 5' in T?, such that all correct processes,
including p, have decided in 511(1i) By Lemma 5, 511(1i) is a configuration of a partial
run of Consensnsv. By the Agreement property of Consensus, p must have decided k in
511(1i) Consider the path that starts from the root of T? and goes to vertex 5 and then
to 5". By Lemma 27(3), there is a time t after which this path is also in T\). Therefore,
when p executes the tagging rule of Figure 10 after time 1, p assigns tag k to 5 in Tp?
(because p has decided k in 511(1i), and 5" is a descendent of 5 in Tp?) 0
Recall that p* is the correct process obtained by applying the selection rule of Figure 8
to the infinite simulation forest T. We now show that there is a time after which any
correct p always selects p* when it applies the corresponding selection rule of Figure 10 to
its own finite approximation of the simulation forest Tp Roughly speaking, the reason
is as follows. By Lemma 28, there is a time t after which the tags of all the roots in
p's forest T? are the same as in the infinite forest T. Since these tags determine the
sets of monovalent and bivalent critical indices, after time t these sets according to p are
the same as in T. Let i be the minimum critical index in these sets, and consider the
situation after time t. If i is monovalent critical, the selection rule of Figure 10 selects p?,
which is what p* is in this case. If z is bivalent critical, then p selects the deciding process
of its current minimum decision gadget of T\) (if it has one). This case is examined below.
Let ?? be the minimum decision gadget of Ti (so, p* is the deciding process of ?*)
For a while, ? may not be the minimum decision gadget of T,,?. This may be because
?? (and its tags) is not yet in Tp? However, by Lemmata 27(3) and 28, ?? (including its
tags) will eventually be in Tp? Alternatively, it may be because Tp? contains a subgraph
? whose encoding is smaller than ?*??? and for a while ? looks like a decision gadget
according to its present tags. However, by Lemma 28, p will eventually determine all
the tags of ?, and discover that ? is not really a decision gadget. Since there are only
finitely many graphs whose encoding is smaller than ?*??, p will eventually discard all
the "fake" decision gadgets (like ?) that are smaller than ??, and then select ?? as its
minimum decision gadget. After that time, p always selects the deciding process of ??
which is precisely p*, in this case.
Theorem 29: For all correct processes p, there is a time after which output,, =
forever.
PROOF: Let i* denote the critical index selected by Line 1 of Figure 8 applied to T. By
Lemma 28, there is a time tinit after which every root of Tp has the same tags as the
corresponding root of T. Thus after time tinit, p always sets i = j* in Line 1 of Figure 10.
We now show that there is a time after which the computation component of p (Figure
10) always returns p*. There are two cases:
1.
31
i* is monovalent critical. In this case, p* is process I)j* (by Line 2 of the selection
rule Figure 8). Similarly, after time t???t: (a) p always sets i to i* (Line 1 of
Figure 10); (b) p always returns ?j* (Line 2 of Figure 10).
2. i* is bivalent critical. Let ?? denote the smallest decision gadget of T??. In this case,
p* is the deciding process of ??. Since ?? is a finite subgraph of T?*, by Lemma 27(3),
there is a time after which ?? is also a subgraph of Tp? By Lemma 28, there is a
time t?* after which all the (finitely many) vertices of ?? receive the same tags in
y? and Tp?* Thus after time t?, ?? is a also decision gadget of Tp?
Since each graph is encoded as a unique natural number, there are finitely many
graphs with a smaller encoding than ??. Let g denote the set of graphs with a
smaller encoding than ??, and ? be any graph in ?. We show that there is a time
after which ? is not a decision gadget of Tp?* There are two cases:
(a) ? is not a subgraph of T?*. In this case, by Lemma 27(1), ? is never a subgraph
of
(b) ? is a subgraph of T?*. Since ?? is the smallest decision gadget of T?* and ?
is smaller than ??, ? is not a decision gadget of T?*. By Lemma 28, there is
a time t? after which all the (finitely many) vertices of ? have the same tags
in Tt and Tp? Thus after time t?, ? is not a decision gadget of Tpt
Since ? is finite, there is a time tq after which no graph in ? is a decision gadget
of
Consider the process that is returned by the computation component of p (Fig-
ure 10) at any time t > max(1???t,t?*,t?). Since t > 1init, P always sets i to i* in
Line 1. Since t > 1?., ?? is a decision gadget of Tp?(t) Finally, since I > t?, ?? is
the smallest decision gadget of Tp?(t) Thus, since ?? is bivalent, at any time after
max(t???t, t?*, t?), Line 3 of Figure 10 returns the deciding process of ??. Therefore,
after time max(t???t, t?*, t?), the computation component of p always returns p*.
From the above, there is a time after which p sets OUlPUIp & p*, forever, in Line 3 of
Figure 9.			0
We now have all the pieces needed to prove our main result, Theorem 2 in Section 5:
Theorem 2: For all environments S, if a failure detector D can be used to solve Con-
sensus in ?, then D ?e
PROOF: Consider the execution of algorithm Tv?n in any environment S. By Theo-
rem 29, there is a time after which all correct processes set outPUlp = P*, forever. By
Theorem 22, p* is a correct process. Thus, Tv?? is a reduction algorithm that transforms
D into ?. In other words, ? is reducible to D. 0
7 Discussion
7.1 Granularity of atomic actions
32
Our model incorporates very strong assumptions about the atomicity of steps. First,
the three phases of each step are assumed to occur indivisibly, and at a single time. In
particular, the failure of a process cannot happen in the "middle of a step". This allows
us to associate a single time t with a step and think of the step as occurring at that
time. Second, in the send phase of a step a message is sent to all processes. Given that
the entire step is indivisible, this means that either all or none of the correct processes
eventually receive the message sent in a step. Finally, no two steps can occur at the
same time.10 These assumptions are convenient because they make the formal model
simpler to describe. Also, they are consistent with those made in the model of [FLP85]
that provided the impetus for this work.
On the other hand, in [CT9ij a model with weaker properties is used. There, the
three phases of a step need not occur indivisibly, and may occur at different times. Even
within the send phase, the messages sent to the different processes may be sent at different
times. Thus, a failure may occur in the middle of the send phase, resulting in some correct
processes eventually receiving the messages sent to them in that step while others never
do. Also, actions of different processes may take place simultaneously, subject to the
restriction that a message can only be received strictly after it was sent. Since [CT91j is
mainly concerned with showing how to use various types of failure detectors to achieve
Consensus, the use of a weaker model strengthens the results. (In fact, the negative
results of [CT9ij hold even in the model of this paper, with the stronger atomicity
assumptions.)
The question naturally arises whether our result also applies to this weaker model.
In other words, if a failure detector D can be used to solve Consensus in the weak model,
is it true that we can transform D to VV in that modet? It turns out that the answer is
affirmative. To see this, first note that if D solves Consensus in the weak model then it
surely solves Consensus in the strong model. By our result, D can be transformed to W
in the strong model. It remains to show that D can be transformed to )`V in the weak
model. This is not obvious, since it is conceivable that the extra properties of the strong
model are crucial in the transformation of D to W. Fortunately, the transformation
presented in this paper actually works even in the weak model!
To see this, it is sufficient to ensure that the communication component of the trans-
formation (cf. Figure 9 in Section 6.6.1) constructs graphs that satisfy the properties
listed in Lemma 24, even if we run it in the weak model. It is not difficult to verify
that this is indeed so. The proof is virtually the same, except for the fact that we must
distinguish the time tin which a process p queries its failure detector and the time t' in
which p adds the value it saw into Gp. In our proof we assume that t = 1'; in the weak
10This is reflected in our formal model by the fact that the list of times in a run (which indicate when
the events in the run's schedule occur) is increasing.
33
model we would have I <I'. Similar comments apply to all actions within a step that are
no longer assumed to occur at the same instant of time. These changes make the proofs
slightly more cumbersome, since we must introduce notation for all the different times
in which relevant actions within a step take place, but the reasoning remains essentially
the same.11
Thus, our result is not merely a fortuitous consequence of some whimsical choice of
model. We view the robustness of the result across different models of asynchrony as
further testimony to the significance of the failure detector W.
7.2 Failure detection and partial synchrony
The fundamental reason why Consensus cannot be solved in completely asynchronous
systems is the fact that, in such systems, it is impossible to reliably distinguish a pro-
cess that has crashed from one that is merely very slow. In other words, Consensus
is unsolvable because accurate failure detection is impossible. On the other hand, it
is well-known that Consensus is solvable (deterministically) in completely synchronous
systems that is, systems where all processes take steps at the same rate and each
message arrives at its destination a fixed and known amount of time after it is sent. In
such a system we can use timeouts to implement a "perfect" failure detector i.e., one
in which no process is ever wrongly suspected, and every faulty process is eventually
suspected. Thus the ability to solve Consensus in a given system is intimately related to
the failure detection capabilities of that system. This realization led to the extension of
the asynchronous model of computation with failure detectors in [CT9i]. In that paper
Consensus is shown to be solvable even with very weak failure detectors that could make
an infinite number of "mistakes
A different tack on circumventing the unsolvability of Consensus is pursued in [DDS87J
and [DLS88J. The approach of those papers is based on the observation that between
the completely synchronous and completely asynchronous models of distributed systems
there lie a variety of intermediate "partially synchronous" models. For instance, in one
model of partial synchrony, processes take steps at the same rate, but message delays
are unbounded (albeit finite). Alternatively, it may be known that message delays are
bounded, but the actual bound may be unknown. In yet another variation, the evenittat
maximum message delay is known, but during some initial period of finite but unknown
duration some messages may experience longer delays. These and many other models of
partial synchrony are studied in [DDS87j and [DLS88], and the question of solvability of
Consensus in each of them is answered either positively or negatively.
11Another problem that must be confronted is that in the proofs of Lemmata 23 and 24 we often refer
to the "first graph" in which a vertex or edge is present. In the strong model there is no difficulty with
this, since processes cannot execute steps simultaneously. In the weak model, we have to justify that it
makes sense to speak of the "first" graph to contain a vertex or edge, in spite of the fact that certain
actions can be executed at the same time. The fact that a message can be received only after it was
sent is needed here.
34
In particular, [DDS87J defines a space of 32 models by considering five key parameters,
each of which admits a "favorable" and an "unfavorable" setting. For instance, one of
the parameters is whether the maximum message delay is known (favorable setting) or
not (unfavorable setting). Each of the 32 models corresponds to a particular setting
of the 5 parameters. [DD887] identifies four "minimal" models in which Consensus is
solvable. These are minimal in the sense that the weakening of any parameter from
favorable to unfavorable would yield a model of partial synchrony where Consensus is
unsolvable. Thus, within the space of the models considered, [DDS87J and [DLS88]
delineate precisely the boundary between solvability and unsolvability of Consensus, and
provide an answer to the question "What is the least amount of synchrony sufficient to
solve Consensus?".
Failure detectors can be viewed as a more abstract and modular way of incorporating
partial synchrony assumptions into the model of computation. Instead of focusing on
the operational features of partial synchrony (such as the five parameters considered
in [DDS87]), we can consider the a?iomatic properties that failure detectors must have
in order to solve Consensus. The problem of implementing a given failure detector in
a specific model of partial synchrony becomes a separate issue; this separation affords
greater modularity.
To see the connection between partial synchrony and failure detectors, it is useful to
examine how one might go about implementing a failure detector. By the impossibility
result of [FLP85], a failure detector that can be used to solve Consensus cannot be
implemented in a completely asynchronous system. Now consider partially synchronous
systems in which correct processes have accurate timers (i .e., they can measure elapsed
time). If in such a system message delays are bounded and the maximum delay is known,
we can use timeouts to implement the "perfect" failure detector described above. In a
weaker system where message delays are bounded but the maximum delay is not known,
we can implement a failure detector satisfying a weaker property: eventually no correct
process is suspected. This can be done by using timeouts of increasing length; once the
timeout period has been increased sufficiently to exceed the unknown maximum delay,
no correct process will be suspected. A failure detector with the same property can also
be implemented in a distributed system where the eventual maximum message delay is
known, but messages may be delayed for longer during some initial period of finite but
unknown duration. With these remarks we illustrate two points: First, that stronger
failure detectors correspond to stronger models of partial synchrony; and second, that
the same failure detector can be implemented in different models of partial synchrony.
Studying failure detectors rather than various models of partial synchrony has several
advantages. By determining whether Consensus is solvable using some specific failure
detector we thereby determine whether Consensus is solvable in all systems in which
that failure detector can be implemented. An algorithm that relies on the axiomatic
properties of a given failure detector is more general, more modular, and simpler to
understand than one that relies directly on some specific operational features of partial
synchrony (that can be used to implement the given failure detector).
35
From this more abstract point of view, the question "What is the least amount of
synchrony sufficient to solve Consensus?" translates to "What is the weakest failure
detector sufficient to solve Consensus?". In contrast to [DDS87], which identified a set
of minimal models of partial synchrony in which Consensus is solvable, we are able to
exhibit a single minimum failure detector that can be used to solve Consensus. The
technical device that made this possible is the notion of reduction between failure de-
tectors. We suspect that a corresponding notion of reduction between models of partial
synchrony, although possible, would be more complex. This is because there are models
which are not comparable in general (in the sense that there are tasks that are possible
in one but not in the other and vice versa), although they are comparable as far as
failure detection is concerned which is all that matters for solving Consensus! In this
connection, it is useful to recall our earlier observation, that the same failure detector
can be implemented in different (indeed, incomparable) models of partial synchrony.
7.3 Weak Consensus
[FLP85J actually showed that even the Weak Consensus problem cannot be solved (de-
terministically) in an asynchronous system. Weak Consensus is like Consensus except
that the validity property is replaced by the following, weaker, property:
Non-triviality: There is a run of the protocol in which correct processes decide 0, and
a run in which correct processes decide 1.
Unlike validity, this property does not explicitly prescribe conditions under which the
correct processes must decide 0 or 1 it merely states that it is possible for them to
reach each of these decisions. It is natural to ask whether our result holds for this weaker
problem as well. That is, we would like to know if the following holds:
Theorem: For all environments S, if a failure detector D can be used to solve Weak
Consensus in C, then D ?E
Under the above definition of non-triviality, this is not quite right. But as we shall argue,
the problem really lies with the definition! Under a slightly stronger definition, which
is more appropriate for our model that incorporates failure detectors, the theorem is
actually true.
Intuitively, the problem with the above definition of non-triviality in our model of
failure detectors is that it is possible for the decision of correct processes to depend
entirely on the the values returned by the failure detector. Consider, for example, a
failure detector D so that for each failure pattern F, D(F) = tHo, H1J, where for all
processes p and times t, and for all i ? fO, lJ, H?(p, t) = i. In other words, in any given
run, this failure detector returns the same binary value to all processes at all times,
independent of the run's failure pattern. It is trivial to use this failure detector to solve
Weak Consensus: A process merely queries its failure detector and decides the value
36
returned! It is easy to see that D ?? ?, for any environment ?: D provides absolutely
no information about which processes are correct or faulty.12
At this point, the reader may justifiably object that D is "cheating" --H it is really
not a failure detector, but a mechanism that non-deterministically chooses the decision
value. One possible way of fixing this problem would be to make our definition of failure
detector less general than it presently is. We could then try to prove the theorem for
this restricted definition of failure detectors. This approach, however, is fraught with the
danger of restricting the definition too much and ruling out legitimate failure detectors
in addition to bogus ones, like D. Intuitively, the failure detector is supposed to provide
some information about faulty processes. As this information may be encoded in a
complex way, we should not arbitrarily rule out such encodings because, in doing so, we
may be inadvertently ruling out useful failure detectors.
Instead of modifying our definition of failure detector, we strengthen the non-triviality
property to require that the failure detector values seen by the processes do not, by
themselves, determine the decision value. To formalize this, let R be a run of a Consensus
algorithm, and (p1,m1,di),(p2,m2,d2),..., be the schedule of R. We denote by fd(R)
the sequence [p1,d1][p2,d2]..., i.e., the sequence of failure detector values seen by the
processes in R. Consider the relation = on runs, where R =--H R' if and only if fd(R) =
fd(R'). It is immediate that = is an equivalence relation. We now redefine the non-
triviality property in our model (where processes have access to failure detectors) as
follows:
Non-triviality: In every equivalence class of the relation = there is a run of the protocol
in which correct processes decide 0, and a run in which correct processes decide 1.
This captures the idea that the decision value cannot be ascertained merely on the basis
of the failure detector values seen by the processes. It must also depend on other aspects
of the run (such as the initial values, the particular messages sent, or other features).
If we define Weak Consensus using this version of non-triviality, then the Theorem
stated above is, in fact, true. We briefly sketch the modifications of our proof needed
to obtain this strengthening of Theorem 2. The only use of the validity property is in
the proof of Lemma 16 which states that the root of T0 is 0-valent and the root of T?
is 1-valent. This, in turn, is used in the proof of Lemma 17, which states that a critical
index exists.
To prove the stronger theorem, we concentrate on the forest induced by all initial
configurations not just 10,... , I?. Thus, the forest now will have 2? trees, rather than
only n+ 1. Consider the n initial values of processes in an initial configuration as an n-bit
vector, and fix any n-bit Gray code.13 Let 10,... , I2?-i be the initial configurations listed
in the order specified by the Gray code, and T? be the tree Tc?, for alliEi?0,...,2?--H1J.
12In fact, V cannot be used to solve Consensus.
13An n-bit Gray code is a sequence of all possible n-bit vectors where successive vectors, as well as
the first and last vectors, differ only in the value of one position. It is well-known that such codes exist
for all n> 1.
37
We use the same definition for a critical index as we had before: Index i E ?O,..., 2? --H 1J
is critical if the root of T? is bivalent or the root of T? is 1-valent while the root of Tt
is 0-valent. The only difference is that we now take subtraction to be modulo 2?, so that
when i = 0, i --H 1 = --H1 = --H 1. We can now prove an analogue to Lemma 17
Lemma: There is a critical index i, 0 < i < 2? --H 1.
PROOF: First, we claim that that the forest contains both nodes tagged 0 and nodes
tagged 1. To see this, let 5 be a node in some tree of the forest. By Lemma 11, 5 has a
tag; without loss of generality, assume that 5 has tag 0. Consider an infinite path that
extends 5. By Lemma 6 and the fact that 5 is tagged 0, there is a run R of the Weak
Consensus algorithm in which a correct process decides 0. By non-triviality, there is a
run =--H R, so that correct processes decide 1 in R'. Let 5' be the infinite schedule and
I? be the initial configuration of run R'. Using the definition of the = relation and the
construction of the induced forest, it is easy to show that every finite prefix of 5' is a
node of T?. Since correct processes decide 1 in R', all these nodes are tagged 1.
Since there are both nodes tagged 0 and nodes tagged 1, by Lemma 13, there are
both roots tagged 0 and roots tagged 1. If the root of some T? is tagged both 0 and 1, it
is bivalent and we are done. Otherwise, the roots of all trees are monovalent, and there
are both 0- and 1-valent roots. Thus, there exist 0 < i, j <2? --H 1 so that the root of Tt
is 0-valent and the root of T? is 1-valent. By considering the sequence T?, T?+1,... ,
(where addition is modulo 2?) it is easy to see that the root of some Tk, k ? i, that
appears in that sequence is 1-valent, while the root of Tk-i is 0-valent. By definition, k
is a critical index.			0
The rest of the proof remains unchanged.
7.4 Failure detectors with infinite range of output values
The failure detectors in [RB91, CT91] only output lists of processes suspected to have
crashed. Since the set of processes is finite, the range of possible output values of these
failure detectors is also finite. In this paper our model allows for failure detectors with
arbitrary ranges of output values, including the possibility of infinite ranges! We illustrate
the significance of this generality by describing a natural class of failure detectors whose
range of output values is infinite (though each value output is finite).
Example: One apparent weakness with our formulation of failure detection is that a
brief change in the value output by a failure detector module may go unnoticed. For
example, process p's module of the given failure detector, D, may output d1 at time
t1, d2 at a later time t2 and d1 again at time t3 after t2. If due to the asynchrony of
the system p does not take a step between time t1 and t3, p may never notice that its
failure detector module briefly output d2. A natural way of overcoming this problem
is to replace D with failure detector D' that has the following property: D' maintains
the same list of suspects as D but when queried, D' returns the entire history of its list
of suspects up to the present time. In this manner, correct processes are guaranteed
38
to notice every change in D"s list of suspects. As the system continues executing, the
values output by D' grow in size. Thus DI has an infinite range of output values.
However, since D is a function of F, the failure pattern encountered, DI is also a
function of F, and can be described by our model. Thus, the result in this paper applies
to DI, a natural failure detector with infinite range of output values.
Appendix
In this appendix, & is a DAG that satisfies Properties 1--H4 listed in Section 6.2. To give
the proofs of the lemmata in Section 6.2, we first show two auxiliary results.
Lemma 30: Let V be any finite subset of vertices in G. G has an infinite path g such
that:
o+ There is an edge from every vertex of V to the first vertex of g.
o+ If [p,--HJ is a vertex of g then p is correct; for each correct p, there are infinitely
many vertices [1),--H] in g.
PROOF: By repeated application of Property 4 of the DAG G. 0
Lemma 31: 5 is a schedule associated with a path of T5 that starts from the root if
and only if 5 is a schedule compatible with 0 and applicable to 1.
PROOF: The lemma obviously holds if 5 is a finite schedule (this is immediate from the
definitions). Now let 5 = e1, ..?...,...... be an infinite schedule, where e? = [qi, m?, djj.
We define 50 = Si, Si = e1, 52=51.e2, and in general Si=5????e? for alli=1,2,...
Assume that 5 is compatible with 0 and applicable to 1. We must show that 5 is a
schedule associated with a path of ?G1 that starts from the root. To see this, note that
for all i > 0, S? is a finite schedule that is also compatible with 0 and applicable to 1
Thus, all the schedules 5o,5i,5?,... ,5i--H1,5i,.. are vertices of Tk. Since Sj = 5i--Hi
the edge from 5i--Hi to S? is labeled e?, for all i > 1. Thus, 5 = e1, ..... .,.?.... is the
schedule associated with the infinite path 5o			5i H 52 H ...			5i-i H 5?			. of
T?'; this path starts from the root 5o = 5i
Assume that 5 is a schedule associated with an infinite path of TG' that starts from
the root. We must show that 5 is compatible with 0 and is applicable to 1. First
note that for all i, S? is a vertex in Tk, thus S? is compatible with 0 and is applicable
to 1. Since S? = [qi,mi,dij,[q2,m2,d2],. . .,[qi,m?,d?j is compatible with 0, 0 must
contain the path 7rj = [qi,di], [q2,d2j,. . ., [q?,d?] (for all i). Note that, for all i, :ri+i =
7r? [q?+i, di+i] is an extension of the path ?i in 0. Therefore, & contains the infinite path
[q1,di],?2,d2],...,[q?,d?j,... So 5 is compatible with 0. Furthermore, since all Si's are
applicable to 1, by definition of applicability, the infinite schedule 5 is also applicable to
1.			Thus, 5 is compatible with 0 and applicable to 1			0
39
Lemma 5: Let 5 be a schedule associated with a finite path of TG' that starts from
the root. There is a sequence of times T such that (F, Hv, I, 5, T) is a partial run of
Consensusv.
PROOF: By Lemma3l, 5 is applicable to I and compatible with C. Thus 5 is compatible
with some finite path g = [q1,d1j,[q2,d2j,... ,[qi,di],..., [qk, dk] of G. From Property 1
of G (applied to every vertex of the path 9), there is a sequence T=t1,t2,... ....... , tk
of times such that for all i, 1 < i < k, dj = Hv(qj,tj) and q? ? F(ti). From Property
2 of G (applied to every edge of the path g), for all i, 1 < i < k, t? < tj+1. Thus T
is a sequence of increasing times, and, by definition, (F, Hv, I, 5, T) is a partial run of
Consensusv.			0
Lemma 6: Let 5 be a schedule associated with an infinite path of T01 that starts
from the root. if in 5 every correct process takes an infinite number of steps and every
message sent to a correct process is eventually received, there is a sequence of times T
such that (F, Hv, I, 5, T) is a run of Consensusv.
PROOF: Similar to Lemma 5.			0
Lemma 7: For any two initial configurations I and I', if 5 is a vertex of ?1G and is
applicable to I' then 5 is also a vertex of Tk'
PROOF: Follows directly from the definitions.			0
Lemma 8: Let 5 be any vertex of ?G1 and p be any correct process. Let m be a
message in the message buffer of 5(1) addressed to p or the null message. For some
5 has a child 5 (p,m,d) in Tk.
PROOF: From the definition of T01, 5 is compatible with some finite path 9 of 0 and
applicable to I. Let v denote the last vertex of g. By Property 4, there is a d such
that v \p,'f\ is an edge of 0. Therefore, g. [p,d] is a path of 0, and 5. (p, m, d) is
compatible with 0.
it remains to show that 5. (p, m, d) is applicable to I. Since 5 is applicable to I, it
suffices to show that (p, m, d) is applicable to 5(I). But this is true since, by hypothesis,
m is in the message buffer of 5(1) and addressed to p, or the null message. 0
Lemma 9: Let 5 be any vertex of TG' and p be any process. Let m be a message in
the message buffer of 5(1) addressed to p or the null message. Let 5' be a descendent
of 5 such that, for some d, 5'. (p, m, d) is in T'G. For each vertex 5" on the path from
5 to 5' (inclusive), 5". (p, Tn, d) is also in ?G1
PROOF: Since they are vertices of T01, 5, 5" and 5'. (p, m, d) are compatible with some
finite paths 9, 9.9" and 9.9". g' \p,(f\ of 0, respectively. From Property 3 (transitive
closure) of 0, 9.9". [p,(f\ is also a path of 0. So 5" (p, m, d) is compatible with this
path of 0. We now show that 5". (p, m, d) is also applicable to I, and therefore it is a
vertex of Ta'.
40
Since 5" is a vertex of T'G, 5" is applicable to I. If m = A, then (p, m, d) is obviously
applicable to 5"(I). Now suppose m $ A. Since 5' (p, m, d) is a vertex of ?G1, (p, m,
is applicable to 5'(I), and thus m is in the message buffer of S'(J). Since each message
is sent at most once and m is in the message buffers of 5(1) and S'(I), there is no edge
of the type (p, m, --H) on the path from 5 to 5'. So m is also in the message buffer of
5"(I), and (p, m, d) is applicable to 5"(I). 0
j&O			and applicable to I
50 & 5			?5? is compatible with g
repeat forever
j &jtl
Let [q3, d5j be the j-th vertex of path g?
Let m5 be the oldest message addressed to q? in the message buffer of 5j--Hi (I)
(if no such message exists, m? = A)
& (q?,mj,dj)
& 5??? e? ?5? is compatible with g [qi, d1] ... , [q,, dj] and applicable to IJ
Figure 11: Generating schedule 5. E?, compatible with path g 9oo, in TG'
Lemma 10: Let 5, 5o, and Si be any vertices of T?'. There is a finite schedule E
containing only steps of correct processes such that:
1. 5 E is a vertex of TG' and all correct processes have decided in 5
2. For i = 0,1, ifE is applicable to S?(I) then Si E is av&texofT01.
PROOF: Since 5 is a vertex of TG', 5 is compatible with some finite path g of 0 and
is applicable to I. Similarly, 5o and S? are compatible with some finite path go and gi,
respectively, of 0. From Lemma 30 (applied to the last vertices of g, go and gi), 0 has
an infinite path g? = [qi, d1], [q?, d2],.. ., [q?, ....... with the following two properties:
1. There is an edge from the last vertex of g, go and gi to the first vertex of g?. (Thus,
g g?, go g?, and gi g? are infinite paths in 0.)
2. If [p,--H] is a vertex of g? then p is correct; for each correct p, there are infinitely
many vertices [p,--H] in g?.
We now show how to construct the required schedule E. Consider the infinite sequence
of schedules 50, 5i,52,?? , 5,, . constructed by the algorithm in Figure 11. An easy
induction shows that for all j > 0, S? is applicable to I and is compatible with g [qi, d1].
o+ .. [q?,d,], a prefix of the path g g? inC. So, for all j >0, S? is a vertex ofT01.
Consider the infinite path of T01 that starts from the root of Tk then goes to 5? = 5,
41
and then to 5i,52,???,5?,?? The infinite schedule associated with that path is 500 =
5 e1 e2?...?e3?... Note that schedule E00 = e1 e2 .... ..... is compatible with path
Yoo of G. By Property (2) of path g00, every correct process p takes an infinite number
of steps in E00 (and thus also in 500 = 5 E00). Since in each one of these steps p
receives the oldest message that is addressed to it, every message sent to p (in 500) is
eventually received. By Lemma 6, there is a T such that R = (F, Hv, 1,500, T? is a run
of Consensusv.
From the termination requirement of Consensus, 500 has a finite prefix 5d such that
all correct processes have decided in 5d(1). There are two cases:
o+ 5d is a prefix of 5. Since decisions are irrevocable, all correct processes remain
decided in 5(1). Thus Sj, the empty schedule, is the required E.
o+ 5 is a prefix of 5d Thus, 5d = 5 E where E is a finite prefix of E00. Since E00
is compatible with g00, E is compatible with a prefix of g00. Now consider 5o (the
following argument also applies to Si). Since 5o is compatible with go, 5o E is
compatible with a prefix of go g00, a path in G. So, 5o E is compatible with G.
If 5o E is also applicable to I, then, by the definition of Tk, it is a vertex of T'G.
The same argument holds for Si. It remains to show that E contains only steps
of correct processes. This is immediate from Property (2) of g00 and from the fact
that E is compatible with a prefix of g00.			0
Acknowledgement5
We would like to thank Prasad Jayanti for his valuable comments on various versions
of this paper. We would also like to thank Cynthia Dwork, Dexter Kozen and the
distributed systems group at Cornell for many helpful discussions.
References
[ABD+87]
[BMZ88]
Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Daphne Koller, David Peleg,
and Rildiger Reischuk. Achievable cases in an asynchronous environment. In
Proceedings of the Twenty-Ei?hth Symposium on Foundations of Computer
Science, pages 337--H346. IEEE Computer Society Press, October 1987.
Ofer Biran, Shlomo Moran, and Shmuel Zaks. A combinatorial characteriza-
tion of the distributed tasks that are solvable in the presence of one faulty
processor. In Proceedings of the Seventh A CM Symposium on Principles of
Distributed Computing, pages 263--H275. ACM Press, August 1988.
[BW87J Michael Bridgland and Ronald Watro. Fault-tolerant decision making in t?
tally asynchronous distributed systems. In Proceedings of the Sixth ACM
42
Symposium on Principles of Distributed Computing, pages 52--H63. ACM Press,
August 1987.
[CD89] Benny Chor and Cynthia Dwork. Randomization in byzantine agreement.
Advances in Computer Research, 5:443--H497, 1989.
[CT91]
[DDS87]
[DLP+86]
Tushar D. Chandra and Sam Toueg. Unreliable failure detectors for asyn-
chronous systems. Technical Report 91-1225, Department of Computer Sci-
ence, Cornell University, August 1991.			Available by anonymous ftp from
ftp.cs.cornell.edu			in
pub/chandra/failure detectors. algoritlims dvi. z. A preliminary ver-
sion appeared in the Proceedings of the Tenth ACM Symposium on Principles
of Distributed Computing, pages 325--H340. ACM Press, August 1991.
Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal syn-
chronism needed for distributed consensus. Journal of the ACM, 34(1):77--H97,
January 1987.
Danny Dolev, Nancy A. Lynch, Shlomit 5. Pinter, Eugene W. Stark, and
William E. Weihl. Reaching approximate agreement in the presence of faults.
Journal of the ACM, 33(3):499--H516, July 1986.
[DL588] Cynthia Dwork, Nancy A. Lynch, and Larry Stockmeyer. Consensus in the
presence of partial synchrony. Journal of the ACM, 35(2):288--H323, April1988.
[FLP85j Michael J. Fischer, Nancy A. Lynch, and Michael 5. Paterson. Impossibility of
distributed consensus with one faulty process. Journal of the ACM, 32(2):374--H
382, April 1985.
Aleta Ricciardi and Kenneth P. Birman. Using process groups to implement
failure detection in asynchronous environments. In Proceedings of the Tenth
ACM Symposium on Principles of Distributed Computing, pages 341--H351.
ACM Press, August 1991.
[RB91]
