BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1377
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Unreliable Failure Detectors for Asynchronous Distributed Systems
AUTHOR:: Chandra, Tushar Deepak 
DATE:: August 1993
PAGES:: 120
COPYRIGHT:: Tushar Deepak Chandra 1993 - All Rights Reserved
ABSTRACT::
It is well-known that several fundamental problems of fault-tolerant 
distributed computing, such as Consensus and Atomic Broadcast, 
cannot be solved in asynchronous systems with crash failures. These 
impossibility results stem from the lack of reliable failure detection in such 
systems. To circumvent such impossibility results, we introduce the concept of 
unreliable failure detectors that can make mistakes, and study the 
problem of using them to solve Consensus (and Atomic Broadcast).

It is easy to solve Consensus using a ``perfect'' failure detector (one that 
does not make mistakes). But is perfect failure detection necessary to solve
Consensus? We show that Consensus is solvable with unreliable failure 
detectors, even if they make an infinite number of mistakes. This leads to the 
following question: What is the ``weakest'' failure detector for solving 
Consensus? We introduce a notion of algorithmic reducibility that allows us to 
compare seemingly incomparable failure detectors. Using this concept, we show 
that one of the failure detectors that we introduce here is indeed the 
weakest failure detector for solving Consensus in asynchronous systems with a 
majority of correct processes.

We also show that Consensus and Atomic Broadcast are equivalent in 
asynchronous systems. Thus all our results regarding the solvability of 
Consensus using failure detectors, apply to Atomic Broadcast as well.

The work in this thesis was funded by an IBM graduate fellowship and grants 
from NSF, DARPA/NASA, the IBM Endicott Programming Laboratory, Siemens Corp. 
and the Natural Sciences and Engineering Research Council of Canada.
END:: CORNELLCS//TR93-1377
BODY::
Unreliable Failure Detectors for
Asynchronous Distributed Systems
Tushar Deepak Chandra
Ph.D Thesis
93-1377
August 1993
Department of Computer Science
Cornel University
Ithaca, NY 14853-7501
UXR?ELIABLE FAILUR?E DETECTOR?S FofQ
ASYXCHR?ONOUS DISTR?IBUTED SYSTEMS
A Dissertation
Presented to the F'aculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Tusliar Deepak Chandra
May 1993
o Tushar Deepak Chandra 1993
ALL RIGliTS RESERVED
UNRELIABLE FAILURE DETECTORS FOR ASYNCHRONOUS
DISTRIBUTED SYSTEMS
Tushar Deepak Chandra, Ph.D.
Cornell University 1993
It is well-known that several fundamental problems of fault-tolerant distributed
computing, such as Consensus and Atomic Broadcast, cannot be solved in asyn-
chronous systems with crash failures. These impossibility results stem from the
lack of reliable failure detection in such systems. To circumvent such impossibil-
ity results, we introduce the concept of unreliable failure detectors that can make
mistakes, and study the problem of using them to solve Consensus (and Atomic
Broadcast).
It is easy to solve Consensus using a "perfect" failure detector (one that does
not make mistakes). But is perfect failure detection necessary to solve Consensus?
We show that Consensus is solvable with unreliable failure detectors, even if they
make an infinite number of mistakes. This leads to the following question: What
is the "weakest" failure detector for solving Consensus? We introduce a notion of
algorithmic reducibility that allows us to compare seemingly incomparable failure
detectors. Using this concept, we show that one of the failure detectors that
we introduce here is indeed the weakest failure detector for solving Consensus in
asynchronous systems with a majority of correct processes.
We also show that Consensus and Atomic Broadcast are equivalent in asyn-
chronous systems. Thus all our results regarding the solvability of Consensus using
failure detectors, apply to Atomic Broadcast as well.
The work in this thesis was funded by an IBM graduate fellowship and grants
from NSF, DARPA/NASA, the IBM Endicott Programming Laboratory, Siemens
Corp and the Natural Siences and Engineering Research Council of Canada.
Biographical Sketch
Thshar Deepak Chandra was born in New Delhi, India on November 13,1966. He
spent his childhood in various cities in India: Bombay, Calcutta and finally Kanpur.
After completing high school at the Doon school, he went on to do a Bachelor of
Technology in Computer Science at the Indian Institute of Technology at Kanpur.
He joined the graduate program in Computer Science at Cornell University in
August 1988.
iii
This thesis is dedicated to my parents who taught me how to think.
iv
Acknowledgements
A large number of people contributed either directly or indirectly to this thesis. I
was extremely fortunate to have Sam Toneg as my advisor. Without his guidance,
I would still be trying to shave off deltas from broadcast algorithms; without his
cooking knowledge, I would never have learnt to cook pasta sauce "a la putanesca"
Sam has put in as much work?if not more--Htowards this thesis, as I have.
I am indebted to Vassos Hadzilacos, my other co-author, for his invaluable
contributions to this work. Without his help, Chapter 4 would not have happened.
Further, his critical readings of the work presented in Chapter 3 have greatly
influenced the way it is presented.
The idea of using unreliable failure detectors came from the many invaluable
discussions I had with the Isis folks, in particular Aleta Ricciardi and Ken Bir-
man (also see [RB9l]). Somnath Biswas, Dexter Kozen, George Odifreddi and
Rod Downey taught me recursion theory and thus gave me the tools to prove the
result in Chapter 4. I was introduced to the rotating coordinator paradigm by a
homework solution of Navin Budhiraja (see Figure 3.6). Prasad Jayanti greatly
simplified the algorithm in Figure 3.3. Earlier, there were two algorithms in place
of this algorithm and twice the text!
I would like to thank my thesis committee, i.e. Ken Birman, Dexter Kozen,
v
Anil Nerode, and Sam Toueg for their guidance and support. I would also like to
thank O?zalp Babaoglu, Cynthia Dwork, Keith Marzullo, Gil Neiger, Mike Reiter,
Fred Schneider and the distributed systems group at Cornell for their comments
and criticisms.
A special thanks to my various housemates, V. V. Bhaskar, Sandip Bose, Navin
Budhiraja, Suresh Chari, Radha Jagadeesan, "Junglee" Jayesh, Mayan Moudgill,
and Alka Parikh for the many stimulating discussions that have influenced me and
hence this thesis.
Finally, I'd like to thank my family: Romesh, Manjula, and Siddharth Chandra
for putting up with erratic hours, forgetfulness, etc. and for providing me with their
love and support.
vi
Table of Contents
1			Introduction
2			Models of distributed systems
3 Solving agreement problems with unreliable failure detectors
3.1			The model . .
3.1.1			Failures and failure patterns .
3.1.2			Failure detectors 			. . .
3.1.3			Completeness . . .
3.1.4			Accuracy
3.1.5			Some failure detector definitions			. .
3.1.6			Algoritlims and runs . . .
3.1.7			Reducibility . .
3.2			From			weak completeness to strong completeness
3.3			Reliable Broadcast
3.4			The Consensus problem . . . .
3.5			Solving Consensus using unreliable failure detectors			.
3.5.1			Using a Strong Failure Detector 8
3.5.2			Using an Eventually Strong Failure Detector			?S
3.5.3			A lower bound on fault-tolerance . .
3.6			On Atomic Broadcast
3.6.1 Reducing Atomic Broadcast to Consensus
4
The weakest failure detector for solving Consensus
4.1			The model. .
4.1.1			Failure detectors			.
4.1.2 Algorithms . . .
4.1.3 Configurations, runs and environments
4.2			The Consensus problem
4.3			Reducibility			. .			.
4.4			An outline of the result .
4.5			The reduction algorithm .
4.5.1			A DAG and a forest			. .			.
vii
11
14
14
15
15
16
16
18
18
20
21
25
26
27
27
32
39
41
43
49
51
51
52
54
56
58
58
60
61
4.5.2			Tagging the simulation forest
4.5.3			Of hooks and forks			. .
4.5.4			Extracting the correct process
4.5.5			The reduction algorithm TD?? 			. . .
Discussion
4.6.1			Granularity of atomic actions			. 			. .
4.6.2			Weak Consensus
4.6.3 Failure detectors with infinite range of output values
4.6.4			Open problems			. . .			.
67
69
74
78
90
90
92
95
96
98
98
103
103
105
106
106
109
112
4.6
5
Related work
5.1			Partial synchrony . .
5.2 The application of failure detection in shared memory systems
5.3			The Isis toolkit			. .			. . .			.
5.4			Other work
A hierarchy of failure detectors and bounds on fault-tolerance
A.1			Mistakes and repentance . . .			. .			. . .			.
A.2			A hierarchy of repetant failure detectors
A.3			Tight bounds on fault-tolerance . .			. .			. .			.
A
Bibliography
117
viii
List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
3.7
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
Some failure detector specifications based on accuracy and com-
pleteness.			. .			.			. .			. . .
Transforming D into D'
T????: From Weak Completeness to Strong Completeness . .
Reliable Broadcast by message diffusion			. .
Solving Consensus using 8. . . . .			.
Solving Consensus using ?S			. .			. . .
Using Consensus to solve Atomic Broadcast
Generating scheduleS E?, compatible with path			9			9oo, in ?k
A fork p is the deciding process .			.			.
A hook--Hp is the deciding process . .
Generating path 7r in T?			. .			.			.
The decision gadgets in T? if z is bivalent critical
Lemma 53.
Lemma 54 . . . . .
Selecting a correct process .			. .
Process p's communication component
Process p's computation component
5.1 A time-out based implementation of ?? in some models of partial
synchrony.			. .			.
The hierarchy of repentant failure detectors ordered by reducibility.
This figure also shows the maximum number of faulty processes for
which Consensus can be solved using each failure detector in this
hierarchy. . . .			. .
ix
19
20
22
25
29
34
44
66
69
69
71
72
75
76
78
79
85
100
A.1
110
Chapter 1
Introduction
The design and verification of fault-tolerant distributed applications is widely
viewed as a complex endeavour. In recent years, several paradigms have been
identified which simplify this task. Key among these are Consensus and Atonize
Broadcast. R?oughly speaking, Consensus allows processes to reach a common deci-
sion, which depends on their initial inputs, despite failures. Consensus algorithms
can be used to solve many problems that arise in practice, such as electing a
leader or agreeing on the value of a replicated sensor. Atomic Broadcast allows
processes to reliably broadcast messages, so that they agree on the set of mes-
sages they deliver and the order of message deliveries. Applications based on these
paradigms include SIFT [WLG+78], State Machines [Lam78,Sch9O], Isis [BJ87,
BCJ+90j, Psync [PBS89], Amoeba [Mul87J, Delta-4 [Pow91j, Transis [ADKM91],
llAS [Cri87], FAA [CDD9Oj, and Atomic Commitment.
Given their wide applicability, Consensus and Atomic Broadcast have been
extensively studied by both theoretical and experimental researchers for over a
decade. In this thesis, we focus on solutions to Consensus and Atomic Broadcast
in the asynchronous model of distributed computing. Informally, a distributed
system is asynchronous if there is no bound on message delay, clock drift, or the
time necessary to execute a step. Thus, to say that a system is asynchronous is to
2
make no timing assumptions whatsoever. This model is attractive and has recently
gained much currency for several reasons: It has simple semantics; applications
programmed on the basis of this model are easier to port than those incorporating
specific timing assumptions; and in practice, variable or unexpected workloads are
sources of asynchrony thus synchrony assumptions are at best probabilistic.
Although the asynchronous model of computation is attractive for the reasons
outlined above, it is well known that Consensus and Atomic Broadcast cannot be
solved deterministically in an asynchronous system that is subject to even a single
crash failure [FLP85,DDS87].1 Essentially, the impossibility results for Consensus
and Atomic Broadcast stem from the inherent difficulty of determining whether a
process has actually crashed or is only "very slow"
To circumvent these impossibility results, previous research focused on the use
of randomization techniques [CD89], the definition of some weaker problems and
their solutions [DLP+86,ABD+87,BW87,BMZ88J, or the study of several models of
partial sunchrony [DDS87,DLS88]. Nevertheless, the impossibility of deterministic
solutions to many agreement problems (such as Consensus and Atomic Broadcast)
remains a major obstacle to the use of the asynchronous model of computation for
fault-tolerant distributed computing.
In this thesis, we propose an alternative approach to circumvent such impos-
sibility results, and to broaden the applicability of the asynchronous model of
computation. Since impossibility results for asynchronous systems stem from the
inherent difficulty of determining whether a process has actually crashed or is only
"very slow", we propose to augment the asynchronous model of computation with
a model of an external failure detection mechanism that can make mistakes. In
particular, we model the concept of unreliable failure detectors for systems with
crash failures.
We consider distributed failure detectors: each process has access to a local
1Roughly speaking, a crash failure occurs when a process that has been executing correctly,
stops prematurely. Once a process crashes, it does not recover.
3
faziure detector module. Each local module monitors a subset of the processes in
the system, and maintains a list of those that it currently suspects to have crashed.
We assume that each failure detector module can make mistakes by erroneously
adding processes to its list of suspects: i.e, it can suspect that a process p has
crashed even though p is still running. If this module later believes that suspecting
p was a mistake, it can remove p from its list. Thus, each module may repeatedly
add and remove processes from its list of suspects. Furthermore, at any given time
the failure detector modules at two different processes may have different lists of
suspects.
It is important to note that the mistakes made by an unreliable failure detector
should not prevent any correct process from behaving according to specification
even if that process is (erroneously) suspected to have crashed by all the other
processes. For example, consider an algorithm that uses a failure detector to solve
Atomic Broadcast in an asynclironous system. Suppose all the failure detector
modules wrongly (and perm?nently) suspect that correct process p has crashed.
The Atomic Broadcast algorithm must still ensure that p delivers the same set of
messages, in the same order, as all the other correct processes. Furthermore, if p
broadcasts a message m, all correct processes must deliver m.2
We define failure detectors in terms of abstract properties as opposed to giving
specific implementat?ons; the hardware or software implementation of failure detec-
tors is not the concern of this thesis. This approach allows us to design applications
and prove their correctness relying solely on these properties, without referring to
low-level network parameters (such as the exact duration of time-outs that are
used to implement failure detectors). This makes the presentation of applications
and their proof of correctness more modular. Our approach is well-suited to model
many existing systems that decouple the design of fault-tolerant applications from
2A different approach was taken by the Isis system [RB9l]: a correct process that is wrongly
suspected to have crashed, is forced to leave the system. In other words, the Isis failure detector
forces the system to conform to its view. To applications such a failure detector makes no
mistakes. For a more detailed discussion on this, see Section 5.3.
4
the underlying failure detection mechanisms, such as the Isis Toolkit [BCJ+90J for
asynchronous fault-tolerant distributed computing.
We characterize a failure detector by specifying the completeness property and
accuracy property that it must satisfy. Informally, completeness requires that the
failure detector eventually suspects every process that actually crashes,3 while ac-
curacy restricts the mistakes that a failure detector can make. We define two
completeness and four accuracy properties, which gives rise to eight failure detec-
tors, and consider the problem of solving Consensus using each one of them.4
To do so, we first introduce the concept of "reducibility" among failure detec-
tors. Informally, a failure detector D' is reducible to failure detector D if there is a
distributed algorithm TD?Dl that can use D to emulate D1. Given this reduction
algonthm, anything that can be done using failure detector Dt, can be done using
D instead. Two failure detectors are equivalent if they are reducible to each other.
Using this concept, we partition our eight failure detectors into four equivalence
classes, and consider how to solve Consensus for each class.
We show that only four of our eight failure detectors can be used to solve
Consensus in systems in which any number of processes may crash. However, if
we assume that a majority of the processes do not crash, then any of our eight
failure detectors can be used to solve Consensus. In order to better understand
where the majority requirement becomes necessary, we study an infinite hierarchy
of failure detectors that contains the eight failure detectors mentioned above, and
show exactly where in this hierarchy the majority requirement becomes necessary.
Cf special interest is ??`V, the weakest failure detector considered in this thesis.
Informally, ?)`V satisfies the following two properties:
3In this introduction, we say that the failure detector suspects that a process p has crashed if
any local failure detector module suspects that p has crashed.
4We later show that Consensus and Atomic Broadcast are equivalent in asynchronous sys-
tems: any Consensus algorithm can be transformed into an Atomic Broadcast algorithm and
vice versa. Thus, we can focus on Consensus since all our results will automatically apply to
Atomic Broadcast as well.
5
o+ Completeness: There is a time after which every process that crashes is always
suspected by some correct process.
o+ Accuracy: There is a time after which some correct process is never suspected
by any correct process.
The failure detector ?W can make an infinite number of mistakes: Each local fail-
ure detector module of ?W can repeatedly add and then remove correct processes
from its list of suspects (this reflects the inherent difficulty of determining whether
a process or a link is just slow or whether it has crashed). Moreover, some correct
processes may be erroneously suspected to have crashed by all the other processes
throughout the entire execution.
The two properties of ?)`V state that eventually something must hold forever'
this may appear too strong a requirement to implement in practice. However, when
solving a problem that "terminates", such as Consensus, it is not really required
that the properties hold forever, but merely that they hold for a sufficiently long
time, i.e., long enough for the algorithm that uses the failure detector to achieve
its goal. For instance, in practice our algorithm that solves Consensus using ??`V
only needs the two properties of ?)`V to hold for a relatively short period of time.
However, in an asynchronous system it is not possible to quantify "sufficiently
long", since even a single process step or a single message transmission is allowed
to take an arbitrarily long amount of time. Thus, it is convenient to state the
properties of ?W in the stronger form given above.
Another advantage of using ?W (as opposed to stronger failure detectors) is
the following. Consider an application that relies on ?W for its correctness. If
this application is run in a system in which the failure detector "malfunctions" and
fails to meet the specification of ?)?V, then we may lose the liveness properties of
the application, but its safety properties will never be violated.
The failure detector abstraction is a clean extension to the asynchronous model
of computation that allows us to solve many problems that are otherwise unsolv-
6
able. Naturally, the question arises of how to support such an abstraction in an
actual system. Since we specify failure detectors in terms of abstract properties,
we are not committed to a particular implementation. For instance, one could
envision specialised hardware to support this abstraction. However, most imple-
mentations of failure detectors are based on time-out mechanisms. For the purpose
of illustration, we now outline one such implementation of ??V.
Informally, if a process times-out on some process q, it adds q to its list of
suspects, and it broadcasts a message to all processes (including q) with this in-
formation. Any process that receives this broadcast adds q to its list of suspects.
If q has not crashed, it broadcasts a refutation. If a process receives q's refutation,
it removes q from its list of suspects.
In the pu?el? asynchronous system, this scheme does not implement ?VV:? an
unbounded sequence of premature time-outs (with corresponding refutations) may
cause every correct process to be repeatedly added and then removed from every
correct process' list of suspects, thereby violating the accuracy property of ?W.
Nevertheless, in many practical systems, one can choose the time-out periods so
that eventually there are no premature time-outs on at least one correct process
p. This gives the accuracy property of ?)`V: there is a time after which p is
permanently removed from all the lists of suspects. Recall that, in practice, it is not
necessary for this to hold permanently; it is sufficient that it holds "long enough"
for the application using the failure detector to complete its task. Accordingly,
it is not necessary for the premature tim&outs on p to cease permanently: it is
sufficient that they cease for "long enough"
Having made the point that ?W can be implemented in practical systems using
time-outs, we reiterate that all reasoning about failure detectors (and algorithms
that use them) should be done in terms of their abstract properties and not in terms
5Indeed, no scheme could implement ?`4' in the purely asynchronous system: as we show
in Section 3.5.2, such an implementation could be used to solve Consensus in such a system,
contradicting the impossibility result of [FLP85j.
7
of any particular implementation. This is an important feature of this approach,
and the reader should refrain from thinking of failure detectors in terms of specific
time-out mechanisms.
The failure detection properties of ?)`Y are sufticieni to solve Consensus in
asynchronous systems. But are they necessary? For example, consider failure de-
tector A that satisfies the completeness property of ?W and the following weak-
ening of ?W's accuracy property:
o+ Accuracy: There is a time after which some correct process is never suspected
by at least 99% of the correct processes.
A is clearly weaker than ?W. Is it possible to solve Consensus using A? Indeed
what is the weakest failure detector sufficient to solve Consensus in asynchronous
systems? In trying to answer this fundamental question we run into a problem.
Consider failure detector B that satisfies the following two properties:
o+ Completeness: There is a time after which every process that crashes is always
suspected by all correct processes.
o+ Accuracy: There is a time after which some correct process is never suspected
by a majority of the processes.
It seems that ? and ?)4? are incomparable: B's completeness property is stronger
than ?W's, and L3's accuracy property is weaker than ?W's. Is it possible to
solve Consensus in an asynchronous system using ?? The answer turns out to be
yes (provided that this asynchronous system has a majority of correct processes,
as ?W also requires). Since ?W and !3 appear to be incomparable, one may be
tempted to conclude that ?>V cannot be the "weakest" failure detector with which
Consensus is solvable. Even worse, it raises the possibility that no such "weakest"
failure detector exists.
However, a closer examination reveals that 13 and ?W are indeed comparable:
there is a distributed algorithm TBH?w that can transform 13 into a failure detector
8
with the properties of ?W. Ts??w works for any asynchronous system that has
a majority of correct processes. In other words, ?VV is reducible to 13 in such a
system.
We prove that ?)/V is reducible to any failure detector D that can be used
to solve Consensus (this result holds for any asynchronous system). We show
this reduction by giving a distributed algorithm TD??w that transforms any such
D into ???`. Thus, in a precise sense, the failure detector QW is necessary and
sufficient for solving Consensus in asynchronous systems (with a majority of correct
processes). This result is further evidence to the importance of ?)/V for fault-
tolerance in asynchronous distributed computing.
In our discussion so far, we focused on the Consensus problem. In Section
3.6, we show that Consensus is equivalent to Atomic Broadcast in asynchronous
systems with crash failures. This is shown by reducing each problem to the other.6
In other words, a solution for one automatically yields a solution for the other. Both
reductions apply to any asynchronous system (in particular, they do not require
the assumption of a failure detector). Thus, Atomic Broadcast can be solved using
the unreliable failure detectors described in this thesis. Furthermore, ?W is the
weakest failure detector that can be used to solve Atomic Broadcast.
A different tack on circumventing the unsolvability of Consensus is pursued in
[DDS87J and [DLS88]. The approach of those papers is based on the observation
that between the completely synchronous and completely asynchronous models of
distributed systems there lie a variety of intermediate partiallysynchronousmodels.
In particular, those two papers consider 34 different models of partial synchrony
and for each model determine whether or not Consensus can be solved. In this
thesis, we argue that partial synchrony assumptions can be encapsulated in the
unreliability of the failure detector. For example, we show how to implement one
of our failure detectors (which is stronger than ??V), in the models of partial
6They are actually equivalent even in asynchronous systems with arbitrary, i.e., "Byzantine"
failures. However, that reduction is more complex and is omitted from this thesis.
9
synchrony considered in [DLS88j. This immediately implies that Consensus and
Atomic Broadcast can be solved in these models, Thus, our approach can be used
to unify several seemingly unrelated models of partial synchrony.7
As we argued earlier, using the asynclironous model of computation is highly
desirable in many applications: it results in code that is simple, portable and ro-
bust, However, the fact that fundamental problems such as Consensus and Atomic
Broadcast have no (deterministic) solutions in this model is a major obstacle to
its use in fault-tolerant distributed computing. Our model of unreliable failure
detectors provides a natural and simple extension of the asynchronous model of
computation, in which Consensus and Atomic Broadcast can be solved determin-
istically. Thus, this extended model retains the advantages of asynchrony without
inheriting its disadvantages. We believe that this approach is an important contri-
bution towards bridging the gap between known theoretical impossibility results
and the need for fault-tolerant solutions in real systems.
The remainder of this thesis is organised as follows. In Chapter 2, we informally
describe several different models of fault-tolerant distributed computing. This puts
our work into perspective.
In Chapter 3, we briefly describe our model and introduce eight failure detectors
in terms of their abstract properties. We describe solutions to the Consensus
problem with each one of these eight failure detectors. Finally, we show that
Consensus and Atomic Broadcast are equivalent. Thus, all our solutions to the
Consensus problem can be transformed into solutions to Atomic Broadcast.
In Chapter 4, we present a more detailed model of asynchronous distributed
computing with unreliable failure detection. We prove that ?W is reducible to
any failure detector for solving Consensus in this model. This shows that ?W is
the weakest failure detector for solving Consensus in asynchronous systems.
Chapter 5 discusses related work, and in particular, we describe an implemen-
7For a more detailed discussion on this, see Chapter 5.
10
tation of an unreliable failure detector that is more powerful than ?W, in several
models of partial synchrony. In the Appendix, we define a failure detector hierarchy
based on the strength of their accuracy and derive lower bounds on fault-tolerance.
Chapter 2
Models of distributed systems
Problems in fault-tolerant distributed computing have been studied in a variety of
computational models, In order to put our work into perspective, we give a broad
overview of some of these models. Such models fall into two broad categories,
message-passin9 and shared-rnerno?y. In the former, processes communicate by
sending and receiving messages over the links of a network; in the latter, they
communicate by accessing shared objects, such as registers, queues, etc. In this
thesis we focus on message-passing models. The following parameters determine a
particular message-passing model:
Types of process failures: A process is fau?ty in an execution if its behaviour
deviates from that prescribed by the algorithrn it is running; otherwise, it is
correct. Several types of failures have been studied in the literature. These
include (1) crash failures--Hin which a faulty process stops prematurely and
does nothing from that point on [LF82], (2) omission failures in which a
faulty process can intermittently omit to send or receive messages [Had84,
PT86], and (3) arbitraryfailures in which faulty processes can behave arbi-
trarily [LSP82]. In this thesis, we only consider crash failures.
11
12
Types of communication failures: Several models of link failures have been
considered in the literature, including crash, omission and arbitrary failures.
In this thesis, we assume that the communication subsystem is reliable, i.e.,
no link failures occur.
Network topology: Several types of network topologies have been studied in the
literature. These fall into three broad categories: point-to-point networks,
brvadcast networks, and miwed networks that permit broadcasts to a limited
number of processes. This thesis focuses on point-to-point networks.
Several types of point-to-point networks have been studied. These include
(1) completely connected networks in which there is a link between every
pair of processes, (2) rings?in which processes are arranged in a ring, etc.
Though the physical network is almost never completely connected, network
layer protocols build virtual links between every pair of processes. Thus, in
this thesis, we assume that the network is completely connected.
Deterministic versus randomized processes: A process' behavior may be ei-
ther deter?inistic or randomized. In general, a process can be modeled as a
(possibly infinite state) automaton. Roughly speaking, the state transition
relation of a deterministic process uniquely determines the state that results
from the execution of each step on the current state. With a randomized
process, the execution of a given step on the current state may result in one
of several possible states, and each such transition has an associated proba-
bility. Informally, a process can "toss coins" to determine which transition
to take.
Since the impossibility of Consensus applies to deterministic processes, they
are the primary focus of this thesis. However, our results of Section 3.6,
regarding the equivalence of Consensus and Atomic Broadcast, apply to both
deterministic and randomized processes.
13
Synchrony: Synclirony is an attribute of both processes and communication. We
say that a system is synchronous if it satisfies the following properties:
o+ There is a known upper bound ? on message delay; this consists of the
time it takes for sending, transporting, and receiving a message over a
link.
o+ Every process p has a local clock Cp with known bounded rate of drift
p > 0 with respect to real-time. That is, for all p and all t > t',
Cp(t)--HCp(t')
(ltp)?1 ?			y?tt)
where Cp(t) is the reading of Cp at real-time t.
o+ There are known upper and lower bounds on the time required by a
process to execute a step.
A system is asynchronous if there is no bound on message delay, clock drift,
or the time necessary to execute a step. Thus, to say that a system is
asynchronous is to make no timing assumptions whatsoever.
The synchronous and asynchronous models are the two extremes of a spec-
trum of possible models. Many intermediate models have also been studied.
For example, processes may have bounded speeds and perfectly synchronized
clocks, but message delays may be unbounded [DDS87j. Or, message delays
may be bounded but unknown [DLS88j
This thesis focuses on the asynchronous model of computation. However,
our work has a direct impact on several models of parttal synchrony (this is
considered in more detail in Section 5.1).
Chapter 3
Solving agreement problems with
unreliable failure detectors
In this chapter, we describe algorithms for Consensus and Atomic Broadcast us-
ing unreliable failure detectors. To do so, we first present an informal model of
distributed computing with unreliable failure detectors. Although informal, the
model that we give here is sufficient for the presentation of the algorithms and
their proofs. In Chapter 4, we extend our model and make it more precise in order
to prove a subtle lower bound.
3.1 The model
We consider asynchronous distributed systems in which there is no bound on mes-
sage delay, clock drift, or the time necessary to execute a step. Our model of asyn-
chronous computation with failure detection is patterned after the one in [FLP85j.
The system consists of a set of u processes, 11 ?P1,P2,.. ,pn?. Every pair of
processes is connected by a reliable communication channel.
To simplify the presentation of our model, we assume the existence of a discrete
global clock. This is merely a fictional device: the processes do not have access to
it. We take the range T of the clock's ticks to be the set of natural numbers.
14
15
3.1.1 Failures and failure patterns
Processes can fail by crashing, i.e., by prematurely halting. A failure pattern F is
a function from ? to 211, where F(t) denotes the set of processes that have crashed
through time t. Once a process crashes, it does not "recover", i.e., Vt : F(t) c
F(t + 1). We define crashed(F) UtET F(t) and correct(F) II --H crashed(F). If
p El crashed(F) we say p crashes in F and if p El correct(F) we say p is correct in
F. We consider only failure patterns F such that at least one process is correct
i.e., correct(F) # ?.
3.1.2 Failure detectors
Each failure detector module outputs the set of processes that it currently suspects
to have crashed.1 A failure detector history H is a function from II x I to 211.
H(p, t) is the value of the failure detector module of process p at time t. If q El
H(p, t), we say that p suspects q at time tin H. We omit references to H when it
is obvious from the context. Note that the failure detector modules of two different
processes need not agree on the list of processes that are suspected to have crashed
i.e., if p # q then H(p,t) # H(q,t) is possible.
Informally, a failure detector D provides (possibly incorrect) information about
the failure pattern F that occurs in an execution. Formally, failure detector D is
a function that maps each failure pattern F to a set of failure detector histories
D(F). This is the set of all failure detector histories that could occur in executions
with failure pattern F and failure detector D.2
In this thesis, we do not define failure detectors in terms of specific implementa-
tions. Such implementations would have to refer to low-level network parameters,
such as the network topology, the message delays, and the accuracy of the local
1In Chapter 4 we study a more general class of failure detectors: their modules can output
values from an arbitrary range.
2111 general, there are many executions with the same failure pattern F (e.g, these executions
may differ by the pattern of their message exchange). For each such execution, D may give a
different failure detector history.
16
clocks. To avoid this problem, we specify a failure detector ill terms of two abstract
properties that it must satisfy: completeness and accuracy. This allows us to design
applications and prove their correctness relying solely on these properties.
3.1.3 Completeness
We consider two completeness properties:
o+ Strong completeness: Eventually every process that crasbes is permanently
suspected by every correct process. Formally, D satisfies strong completeness
if:
VP, VH ? D(F), at ? T, Vp ? crashed(F), Vq C correct(F), Vt' > t : p ? H(q, t')
o+ Weak completeness: Eventually every process that crashes is permanently
suspected by some correct process. Formally, D satisfies weak completeness
if:
VF, VH ? D(F), at c T, Vp ? crashed(F), aq ? correct(F), Vt' > t : p c H(q, t')
However, completeness by itself is not a useful property. To see this, consider
a failure detector which causes every process to permanently suspect every other
process in the system. Such a failure detector trivially satisfies strong completeness
but is clearly useless since it provides no information about failures. To be useful, a
failure detector must also satisfy some accuracy property that restricts the mistakes
that it can make. We now consider such properties.
3.1.4 Accuracy
Consider the following two accuracy properties:
o+ Strong accuracy: No process is suspected before it crashes. Formally, D
satisfies strong accuracy if:
17
VP,VHe D(F),Vt ? ?Vp,q ? II --H F(t) : p ? H(q,t)
Since it is difficult (if not impossible) to achieve strong accuracy in many practical
systems, we also define:
o+ Weak accuracy: Some correct process is never suspected. Formally, D satisfies
weak accuracy if:
VF, VH ? D(F), ?p ? correct(F), Vt ? T, Vq ? II --H F(t) : p ? H(q, t)
Even weak accuracy guarantees that at least one correct process is never suspected.
Since this type of accuracy may be difficult to achieve, we consider failure detec-
tors that may suspect every process at one time or another. Informally, we only
require that strong accuracy or weak accuracy are eventually satisfied. The re-
sulting properties are called eventual strong accuracy and eventual weak accuracy,
respectively.
For example, eventual strong accuracy requires that there is a time after which
strong accuracy holds. Formally, D satisfies eventual strong accuracy if:
VF,VH ? D(F),Ht c T,Vt' > t,Vp,q E It --H F(t') : p ? H(Qt!)
An observation is now in order. Since all faulty processes will crash after some
finite time, we have:
VF, ?t ? T,Vt' > t : It --H F(t')			correct(F)
Thus, an equivalent and simpler formulation of eventual strong accuracy is:
o+ Eventual strong accuracy: There is a time after which correct processes are
not suspected by any correct process. Formally, D satisfies eventual strong
accuracy if:
VF, VH c D(F), ?t ? T Vt' > t Vp, q ? correct(F) : p ? H(q, t')
18
Similarly, we specify eventual weak accuracy as follows:
o+ Eventual weak accuracy: There is a time after which some correct process is
never suspected by any correct process. Formally, D satisfies eventual weak
accuracy if:
VF, VH ? D(F), ?t ? T, ?p c correct(F), Vt' > t, Vq c correct(F) : p ? H(q, t')
We will refer to eventual strong accuracy and eventual weak accuracy as even-
tual accuracy properties, and strong accuracy and weak accuracy as perpetual ac-
curacy properties.
3.1.5 Some failure detector definitions
A failure detector can be specified by stating the completeness property and the
accuracy property that it must satisfy. Combining the two completeness properties
with the four accuracy properties that we defined in the previous section gives rise
to the eight different failure detectors defined in Figure 3.1. For example, we say
that a failure detector is Eventually Strong if it satisfies strong completeness and
eventual weak accuracy. We denote such a failure detector by ?S
3.1.6 Algoritlims and runs
In this thesis, we focus on algorithms that use unreliable failure detectors. To
describe such algorithms, we only need informal definitions of algorithms and runs.
More formal definitions are given in Chapter 4.
An algorithm A is a collection of n deterministic automata, one for each process
in the system. Computation proceeds in steps of A. In each step, a process
(1) may receive a message that was sent to it, (2) queries its failure detector
module, (3) undergoes a state transition, and (4) may send a message. Since
we model asynchronous systems, messages may experience arbitrary (but finite)
delays. Furthermore, there is no bound on relative process speeds.
19
Accuracy
Completeness			Strong			Weak			Eventual Strong Eventual Weak
Eventually			Eventually
Strong			Perfect FD Strong FD			Perfect FD			Strong FD
8
Eventually
Weak			WeakFD			WeakFD
Q
Figure 3.1: Some failure detector specifications based on accuracy and complete-
ness.
Informally, a run of algorithm A using a failure detector D is a tuple R =
(P, HD, 1, 5, T? where P is a failure pattern, HD C D(F) is a history of failure
detector D for failure pattern F, 1 is an initial configuration of A, 5 is an infinite
sequence of steps of A, and T is a list of increasing time values indicating when
each step in 5 occurred. A run must satisfy certain well-formedness and fairness
properties. In particular, (1) a process cannot take a step after it crashes, (2) when
a process takes a step and queries its failure detector module, it gets the current
value output by its local failure detector module, and (3) every process that is
correct in P takes an infinite number of steps in 5 and eventually receives every
message sent to it.
We use the following notation. Let v be a variable in algorithm A. We denote
by Vp processp's copy of v. The history of v in run I? is denoted by vR, i.e., vR(p,t)
is the value of Vp at time tin run R. We denote by Dp process p's local failure
detector module. Thus, the value of Dp at time tin run R = (F, Hv, 1, 5, T? is
HD(p,t).
20
TD?D)
Algorithm B uses
D' emulated
Figure 3.2: Transforming D into Dt
3.1.7 Reducibility
We now define what it means for an algorithm TD?DI to transform a failure detector
D into another failure detector D'. Algoritlim TD?DI uses D to maintain a variable
0UtPUtp at every process p. This variable, reflected in the local state of p, emulates
the output of D' at p. Algorithm TD??Dl t?nsforrns D into D' if and only if for
every run R = (F, HD, I, S, T? of TD?D! using D, outpu? ?
Given the reduction algorithm TD?D1, anything that can be done using failure
detector D', can be done using D instead. To see this, suppose a given algorithm
B requires failure detector D', but only D is available. We can still execute B as
follows. Concurrently with B, processes run TDHDI to transform D into D'. We
modify Algorithm B at process p as follows: whenever p is required to query its
failure detector module, p reads the current value of OUlPUtp (which is concurrently
maintained by TD?D') instead. This is illustrated in Figure 3.2.
Intuitively, since TD?D! is able to use D to emulate D', D provides at least as
much information about process failures as D' does. Thus, if there is an algorithm
21
TD?Di that transforms D into D', we write D ? D' and say that D' is reducible to
D; we also say that D' is weaker than D. If D ? Dt and D' ? D, we write D N= D'
and say that D and D' are equivalent
Note that, in general, TD?D! need not emulate all the failure detector histories
of D'; what we do require is that all the failure detector histories it emulates be
histories of D'
Consider the "identity" transformation TD?D in which each process p peri-
odically writes the current value output by its local failure detector module into
output The following is immediate from TD?D and the definition of reducibility.
Observation 1: ? ? Q 8 ? W ?? ? ?Q, ?8 ?
3.2 From weak completeness to strong
completeness
In Figure 3.3, we give a reduction algorithm TD?+DI that transforms any given
failure detector D that satisfies weak completeness, into a failure detector D' that
satisfies strong completeness. Furthermore, for each failure detector D defined in
Figure 3.1, TD?Di gives a failure detector D' that has the same accuracy prop-
erty as D. Roughly speaking, Tv?v? strengthens the completeness property while
preserving accuracy.
This result allows us to focus on the failure detectors that are defined in the
first row of Figure 3.1, i.e., those with strong completeness. This is because, TD?DI
(together with Observation 1) shows that every failure detector in the second row
of Figure 3.1 is actually equivalent to the corresponding failure detector above it
in that figure.
Informally, TD??DI works as follows. Every process p periodically sends
(p, suspects?) where suspect% denotes the set of processes that p suspects ac-
cording to its local failure detector module--Hto all the processes. When a process
22
Every process p executes the following:
output? &
cobegin
Task 1: repeat forever
?p quernes its local failure detector module Dp?
suspects? Dp
send (p, suspects?) to all
Task 2: when receive (q, suspectsq) for some q
0UtPUtp			(0UtPUtp U suspectsq) --H
coend
Figure 3.3: TD?D': From Weak Completeness to Strong Completeness
q receives a message of the form (p, suspect?), it adds suspects? to output and
removes p from outputq.
Let R (F, HD, I, S, T? be an arbitrary run of TD?DI using failure detector
D. In the following, the run II and its failure pattern P are fixed. Thus, when
we say that a process crashes we mean that it crashes in F. Similarly, when we
say that a process is correct, we mean that it is correct in F. We will show that
outpu? satisfies the following properties:
Pi : (Transforming weak completeness into strong completeness) Let p be any
process that crashes. If eventually some correct process permanently sus-
pects p in HD, then eventually all correct processes permanently suspect p
in outpu?. More formally:
Vp E crashed(F)
? ?, ?q C correct(F), Vt' > t : p E H?(q, t')
?t C T,Vq ? correct(F),Vt' > t : p E outpu?(q,t')
P2 : (Preserving perpetual aecuracg) Let p be any process. If no process suspects
23
p in HD before time t, then no process suspects p in outpu? before time t.
More formally:
Vp ? ll,Vt ?
Vt' ? t, Vq E II --H F(t') : p ? H?(q, t')
Vt' ? t, Vq c II --H F(t') : p ? outpu?(q, t!)
P3 : (Preserving eventual accuracy) Let p be any correct process, If there is a
time after which no correct process suspects p in HD, then there is a time
after which no correct process suspects p in outpu?. More formally:
Vp ? correct(F)
E ?,Vq ? correct(F),Vt' > t : p ?
? T,Vq c correct(F),Vt' > t : p ? outpu?(q,t')
Lemma 2: TD?DI satisfies Pi.
PROOF: Let p be any process that crashes. Suppose that there is a time t after
which some correct process q permanently suspects p in HD. We must show that
there is a time after which every correct process suspects p in output?
Since p crashes, there is a time t' after which no process receives a message from
p. Consider the execution of Task 1 by process q after time tp max(t, t'). Process
q sends a message of the type (q, suspectsq) with p ? suspects to all processes.
Eventually, every correct process receives (q, suspectsq) and adds p to output (see
Task 2). Since no correct process receives any messages from p after time t' and
tp > tt, no correct process removes p from output after time tp. Thus, there is a
time after which every correct process permanently suspects p in outpu?.
Lemma 3: TD??DI satisfies P2.
PROOF: Let p be any process. Suppose that there is a time t before which no
24
process suspects p in HD. No process sends a message of the type (--H, suspects)
with p ? suspects before time t. Thus, no process q adds p to outputq before time
Lemma 4: TD?DI satisfies P3.
PROOF: Let p be any correct process. Suppose that there is a time t after which
no correct process suspects p in HD. Thus, all processes that suspect p after time
t eventually crash. Thus, there is a time t' after which no correct process receives
a message of the type (--H, suspects) with p ? suspects.
Let q be any correct process. We must show that there is a time after which q
does not suspect p in output?.
Consider the execution of Task 1 by process p after time t'. Process p sends
a message m = (p, suspects,) to q. When q receives m, it removes p from outputq
(see Task 2). Since q does not receive any messages of the type (--H, suspects) with
p ? suspects after time tt, q does not add p to outputq after time t!. Thus, there is
a time after which q does not suspect p in outpu?.
Theorem 5: TD?Th transforms Q into ?, 14' into 8, ?Q into ??, and ?W into
PROOF: By Lemma 2, TD?Dl transforms Q, 14', ?Q, and ?W, into failure detec-
tors that satisfy strong completeness. By Lemma 3, TD??Dt preserves the strong
accuracy of Q and the weak accuracy of 14'. By Lemma 4, TD?DI preserves the
eventual strong accuracy of ? Q and the eventual weak accuracy of ?14'.			The
theorem immediately follows.			Ei
By Theorem 5 and Observation 1, we have:
Corollary 6: ? N= Q, 5 N 14', ?? ?Q, and ?S N= ?14'
25
Every process p executes the following:
To execute R-broadcast(m):
send m to all (including p)
R-deliver(m) occurs as follows:
when receive m for the first time
if sende?(m) # p then send m to all
R-dei?ver(m)
Figure 3.4: Reliable Broadcast by message diffusion
3.3 Reliable Broadcast
We now define Reliable Broadcast, a communication primitive that we often use
in our algorithms. Informally, Reliable Broadcast guarantees that (1) all correct
processes deliver the same set of messages, (2) all messages broadcast by correct
processes are delivered, and (3) no spurious messages are ever delivered. Formally,
Reliable Broadcast is defined in terms of two primitives, R-bwadcast(m) and R-
deiiver(m) where m is a message drawn from a set of possible messages. When
a process executes R-broadcast(m), we say that it R-broadcasts m, and when a
process executes R-deiiver(m), we say that it R-deiivers m. Reliable Broadcast
satisfies the following three properties:3
Validity: If a correct process R-broadcasts a message ni, then all correct processes
eventually R-deliver m.
Agreement: If a correct process R-delivers a message ni, then all correct processes
eventually R-deliver ?n.
Uniform integrity: For any message m, each process R-delivers m at most once,
3For simplicity, we assume that each message is unique. In practice, this can be achieved by
tagging the identity of the sender and a sequence number on each message.
26
and only if m was R-broadcast by some process.
In Figure 3.4, we give a simple Reliable Broadcast algoritlim for asynchronous sys-
tems. Informally, when a process receives a message for the first time, it relays the
message to all processes and then R-delivers it. It is easy to show that this algo-
rithm satisfies validity, agreement and uniform integrity in asynchronous systems
with up to n --H 1 crash failures. The proof is obvious and therefore omitted.
3.4 The Consensus problem
In the Consensus problem, all correct processes propose a value and must reach
a unanimous and irrevocable decision on some value that is related to the pro-
posed values [Fis83]. We define the Consensus problem in terms of two primitives,
propose(v) and decide(v), where v is a value drawn from a set of possible proposed
values. When a process executes pr'opos?(v), we say that it proposes v; similarly,
when a process executes deeide(v), we say that it decz?des v. The Consensus prob-
lem is specified as follows:
Termination: Every correct process eventually decides some value.
Uniform validity: If a process decides v, then v was proposed by some process.4
Uniform integrity: Every process decides at most once.
Agreement: No two correct processes decide differently.
It has been proved that there is no deterministic algorithm for Consensus in asyn-
chronous systems that are subject to even a single crash failure [FLP85,DDS87J.
We now show how to use unreliable failure detectors to solve Consensus in asyn-
chronous systems.
4The validity condition captures the relation between the decision value and the proposed
values, Changing this condition results in other types of Consensus [Fis83].
27
3.5 Solving Consensus using unreliable failure
detectors
We now show how to solve Consensus using each one of the eight failure detectors
defined in Figure 3.1. By Theorem 5, we only need to show how to solve Consensus
using the four failure detectors that satisfy strong completeness, namely, ?, 8, ??,
and ?8.
Solving Consensus with the Perfect Failure Detector ? is simple, and is left
as an exercise for the reader. In Section 3.5.1, we give a Consensus algorithm
that uses 8. In Section 3.5.2, we give a Consensus algorithm that uses ?8. Since
? ?8, this algorithm also solves Consensus with ??.
The Consensus algorithm that uses 8 can tolerate any number of failures. In
contrast, the one that uses ?8 requires a majority of correct processes. We show
that this requirement is necessary even if one uses ??, a failure detector that is
stronger than ?8. Thus, our algorithm for solving Consensus using ?8 (or ??)
is optimal with respect to the number of failures that it tolerates.
3.5.1 Using a Strong Failure Detector 8
Given any Strong Failure Detector 8, the algorithm in Figure 3.5 solves Consensus
in asynchronous systems. This algorithm runs through 3 phases. In Phase 1,
processes execute n --H 1 asynchronous rounds (? denotes the current round number
of process p) during which they broadcast and relay their proposed values. Each
process p waits until it receives a round r message from every process that is not
in 8?, before proceeding to round ? + 1. Note that it is possible that while p is
waiting for a message from q in round ?, q is added to 8?. By the above rule, p
stops waiting for q's message and proceeds to round r + 1.
By the end of Phase 2, correct processes agree on a vector based on the proposed
values of all processes. The ith element of this vector either contains the proposed
value of process p? or I. We will show that this vector contains the proposed value
28
of at least one process. In Phase 3, correct processes decide the first non-trivial
component of this vector.
Let fdenote the maximum number of processes that may crash.5 Phase 1 of the
algorithm consists of n--H 1 rounds, rather than the usual f+ 1 rounds of traditional
Consensus algorithms (for synchronous systems). Intuitively, this is because even a
correctprocess p may be suspected to have crashed by other processes. In this case,
p's messages may be ignored, and p appears to commit "send-omission" failures.
Thus, up to n --H 1 processes may appear to commit such failures (rather than f).
Note that because 8 satisfies weak accuracy (namely, some correct process is never
suspected), the maximum number of processes that may fail or appear to fail is
n --H 1 rather than n.
N%[q] denotes p's current estimate of q's proposed value. ??[q] Vq at the end
of round r if and only if p receives Vq, the value proposed by q, for the first time
in round r.
Let R (F, Hs, I, S, T? be any run of the algorithm in Figure 3.5 using 8 in
which all correct processes propose a value. We have to show that termination,
uniform validity, agreement and uniform integrity hold.
Lemma 8: For all p and q, and in all phases, Vp[q] is either ZJq or ffi.
PROOF: Obvious from the algorithm in Figure 3.5.			E]
Lemma 9: Every correct process eventually reaches Phase 3.
PROOF: [sketch] The only way a correct process p can be prevented from reaching
Phase 3 is by blocking forever at one of the two wait statements (in Phase 1 and
2, respectively). This can happen only if p is waiting forever for a message from a
process q and q never joins 8?. There are two cases to consider:
1. q crashes. Since 8 satisfies strong completeness, there is a time after which
5In the literature, t is often used instead of f, the notation adopted here. In this thesis, we
reserve t to denote real-time.
29
Every process p executes the following:
procedure p?opose(v?)
Vp[p]HVp
& Vp
fp `s estzmate of the proposed va1ues?
Phase 1: fasynchronous rounds r?, 1 ? r? ? n --H
for ?? & 1 to n --H 1
send (r?,A?,p) to all
wait until [Vq: received (r?, Aq, q) or q C Sp?
? Query the failure detector?
ms9sp[rp]			?(rp,?q,q) received (rp,Aq,q)?
for k			1 to n
if Vp[k] = I and ?(rp,Aq,q) E msgs?[r?] with Aq[kl # I then
Vp[k] Aq[k]
&
Phase 2:
send Vp to all
wait until [Vq: received ? or q ?			f Query the failure detectorj
lastmsgs?			?Vq I received Vq?
for k			1 to n
if BVq C lastmsgs? with Vq[k] = I then Vp[kl
Phase 3: decide( first non-I component of VP)
Figure 3.5: Solving Consensus using 8.
30
q ?
2. q does not crash. In this case, we can show (by an easy but tedious induction
on the round number) that q eventuatly sends the message p is waiting for.
In both cases p is not blocked forever and reaches Phase 3.
Since S satisfies weak accuracy there is a correct process c that is never suspected
by any process, i.e., Vt ? T,Vp ? II --H F(t) : c ? Hs(p,t). Let lli denote the set
of processes that complete all n --H 1 rounds of Phase 1, and H2 denote the set of
processes that complete Phase 2. We say Vp < vq if and only if for all k ? II, VP[k]
is either Vq[k] or i.
Lemi?a 10: In every round ? 1 ? r ? n --H 1 all processes p ? fi1 receive (r, Ac, c)
from process c, i.e., (r, Ac, c) is in rnsgsp[?
PROOF: Since p c 111, p completes all n --H 1 rounds of Phase 1. At each round r,
since c ? S?, ? waits for and receives the message (r, Ac, c) from c.
Lemma 11: For allp? ll?, Vc S VP at the end of Phase 1.
PROOF: Suppose for some process q, V?[q] $ ffi at the end of Phase 1. From
Lemma 8, V?[q] = Vq. Consider any p c ?i We must show that ?[q] = Vq at the
end of Phase 1. This is obvious if p = c, thus we consider the case where p $ c.
Let r be the first round in which c received Vq (if c = q, we define ? to be 0).
From the algorithm, it is clear that A?[q] = Vq at the end of round r. There are
two cases to consider:
1. ? _ n --H 2. In round r + 1 ? n --H 1, c relays Vq by sending the message
+ 1, Ac, c) with A?[q1 = Vq to all. From Lemma 10, p receives (? + 1, Ac, c)
in round r + 1. From the algorithm, it is clear that p sets ?[q] to Vq by the
end of round r + 1.
31
2. r = n --H 1. In this case, c received Vq for the first time in round n --H 1.
Since each process relays Vq (in its vector A) at most once, it is easy to see
that Vq was relayed by all n --H 1 processes in 11 --H ?c?, including p, before
being received by c. Since p sets Vp[qj = Vq before relaying Vq, it follows that
Vp[q] = Vq at the end of Phase 1.
Lemma 12: For all p E 112, Vc = Vp at the end of Phase 2.
PROOF: Consider any p ? 112 and q E II. We have to show that Vp[q] = V?[q] at
the end of Phase 2. There are two cases to consider:
1. V?[q] = Vq at the end of Phase 1. From Lemma 11, for all processes p' ? Hi
(including p and c), Vpi[qj = Vq at the end of Phase 1. Thus, for all the
vectors V sent in Phase 2, V[q] = Vq. Hence, both Vp[q] and V,[q] remain
equal to Vq throughout Phase 2.
2. V?[q? = i at the end of Phase 1, Since C ? 8?, p waits for and receives Vc in
Phase 2. Since V,[q] = L, p sets Vp[q] H I at the end of Phase 2.
Lemma 13: For all p c 112, Vp[c] = Vc at the end of Phase 2.
PROOF: It is clear from the algorithm that Vc[c] = Vc at the end of Phase 1. From
Lemma 11, for all q ? Hi, ?)i[c] = Ve at the end of Phase 1. Thus, no process sends
V with V[c] = I in Phase 2. From the algorithm, it is clear that for all p ? 112,
VpH = Vc at the end of Phase 2.
Theorern 14: Given any Strong Failure Detector 8, the algorithm in Figure 3.5
solves Consensus in asynchronous systems with f < n.
PROOF: From the algorithm in Figure 3.5, it is clear that no process decides more
than once, and this satisfies the uniform integrity requirement of Consensus. From
Lemma 9, every correct process eventually reaches Phase 3. From Lemma 13, the
vector vi) of every correct has at least one non-I component in Phase 3 (namely,
32
vp[cl = Vc). From the algorithm, every process p that reaches Phase 3, decides on
the first non-I component of ?),. Thus, every correct process decides some non-I
value in Phase 3--Hand this satisfies termination of Consensus. From Lemma 12, all
processes that reach Phase 3 have the same vector V. Thus, all correct processes
decide the same value, and agreement of Consensus is satisfied. From Lemma 8,
this non-I decision value is the proposed value of some process. Thus, uniform
validity of Consensus is also satisfied.
By Theorems 5 and 14, we have:
Corollary 15: Given any Weak Failure Detector W, Consensus is solvable in
asynchronous systems with f <n.
3.5.2 Using an Eventually Strong Failure Detector ?8
Our previous solution to Consensus used 8, a failure detector that satisfies weak
accuracy: at least one correct process was never suspected. We now solve Consen-
sus using ?S, a failure detector that only satisfies eventual weak accuracy. With
?S, all processes may be erroneously added to the lists of suspects at one time or
another. However, there is a correct process and a time after which that process is
not suspected to have crashed. Note that at any given time t, processes cannot use
?S to determine whether any specific process is correct, or whether some correct
process will never be suspected after time t.
Given any Eventually Strong Failure Detector ?S, the algorithm in Figure 3.6
solves Consensus in asynchronous systems with a majority of correct processes. We
show that solving Consensus using ?S requires this majority.6 Thus, our algorithm
is optimal with respect to the number of failures that it tolerates.
The algorithm in Figure 3.6 uses the rntating coordinator paradigm [R?ei82,
6In fact, we show that a majority of correct processes is required even if one uses ??, a
stronger failure detector.
33
CM84, DLS88, BGP89, CT90j. Computation proceeds in asynchronous "rounds"
We assume that all processes have a priori knowledge that during round r, the
coordinator is process c = (r mod n) + 1. All messages are either to or from the
current" coordinator. Every time a process becomes a coordinator, it tries to
determine a consistent decision value. If the current coordinator is correct and is
not suspected by any surviving process, then it will succeed, and it will R-broadcast
this decision value.
The algorithm in Figure 3.6 goes through three asynchronous epochs, each of
which may span several asynchronous rounds. In the first epoch, several decision
values are possible. In the second epoch, a value gets locked: no other decision
value is possible. In the third epoch, processes decide the locked value.
Each round of this Consensus algorithin is divided into four asynchronous
phases. In Phase 1, every process sends its current estimate of the decision value
timestamped with the round number in which it adopted this estimate, to the
current coordinator, c. In Phase 2, c gathers n --H f such estimates, selects one
with the largest timestamp, and sends it to all the processes as their new estimate,
estimate?. In Phase 3, for each process p there are two possibilities:
1. p receives estimate? from c and sends an ack to c to indicate that it adopted
estimate? as its own estimate; or
2. upon consulting its failure detector module ?S?, p suspects that c crashed,
and sends a nack to c.
In Phase 4, c waits for n --H f replies (acks or nacks). If all n --H f replies are acks,
then c knows that n --H f processes changed their estimates to est?mate?, and thus
estimate? is locked. Consequently, c R-broadcasts a request to decide estimate?.
At any time, if a process R?-delivers such a request, it decides accordingly.
The proof that the algorithm in Figure 3.6 solves Consensus is as follows. Let
R be any run of the algorithm in Figure 3.6 using ?S in which all correct processes
34
Every process p executes the following:
procedure propose(v?)
estimate? H Vp
state?			undecided
H 0
tsp			0
fdenotes p,s estimate of the decision value)
(r? denotes the current round number)
fthe round in which			estimate? was last updated, initially 0)
fRotate through coordinators until decision is reached)
while state? = undecided
r? + 1
(r? mod n) + 1			fc? is the current coordinator)
Phase 1: fAll processes p send estimate? to the current coordinator)
send (p,r?,estimate?,ts?) to
Phase 2: f The current coordinator gathers n --H f estimates and proposes a new estimate)
ifp=c? then
wait until [for n --H f processes q : received (q, r?, estimateq, tsq) from q]
msgs?fr?] & f(q,rp,estimateq, tsq) p received (q,rp,estimateq, tsq) from q)
t largest tsq such that (q, r?, estimateq, tsq) c msgs?fr?J
estimate? select one estimateq such that (q,rp,estimateq,t) ? msgs?fr?]
send (p, r?, estimate?) to all
Phase 3: fAll processes wait for the new estimate proposed by the current coordinato4
wait until [received (c?, r?, estimate0?) from cp or c? ? ?Sp]f Query the failure detector)
if [received (c?, r?, estimate?,,) from c?] then			fp received estimate?? from c?)
estimate?			estimate??
tsp
send (p,r?,ack) to
else send (p, r?, nack) to
fp suspects that cp crashed)
Phase 4: The current coordinator waits for n --H f replies. If these replies indicate that
n--Hf processes adopted its estimate, the coordinator sends a request to decide.
if p = c? then
wait until [for n --H f processes q : received (q,r?,ack) or (q,r?,nack)]
if [for n --H f processes q : received (q, r?, ack)) then
R-broadcast(p, r?,estimate?, decide)
f When p receives a decide message, it decides)
when R?dehver(q, r?,estimate q, decide)
if state? = undecided then
decide (estimateq)
state? decided
Ei?ure 3.6: Solvina Consensus usina ?S
35
propose a value. We have to show that termination, uniform validity, agreement
and uniform integrity hold.
Lemma 17: No two processes decide differently.7
PROOF: If no process ever decides, the leinma is trivially true. If any process
decides, it must be the case that a coordinator R-broadcast a message of the type
--H, --H, decide). This coordinator must have received n --H f messages of the
type (--H, --H, ack) in Phase 4. Let ? be the smallest round number in which n --H f
messages of the type (--H, m ack) are sent to a coordinator in Phase 3. Let c denote
the coordinator of round r, i.e., c = (r mod n) + 1. Let estirnate? denote c's
estimate at the end of Phase 2 of round r. We claim that for all rounds r' > r if a
coordinator c' sends e8tirnate?i in Phase 2 of round ?`, then estimatec; = estimate?.
The proof is by induction on the round number. The claim trivially holds
for rt = r. Now assume that the claim holds for all ?`, r < < k. Let c?
be the coordinator of round k, i.e., c? = (k mod n) + 1. We will show that the
clahu holds for r' = k, i.e., if c? sends ?Stimatec? in Phase 2 of round k, then
?8timatec? = estimate?.
From the algorithm it is clear that if c? sends ?8timatec? in Phase 2 of round
k then it must have received estimates from at least n--Hf processes. Since f ?
there is some process p such that p sent a (p,r,ack) message to c in Phase 3 of
round ? and such that (p,k,estimate?,t8p) is in ?S9Sc? [k] in Phase 2 of round k.
Since p sent (p,mack) to c in Phase 3 of round ?, tsp = ? at the end of Phase 3 of
round r. Since tsp is non-decreasing, tsp > r in Phase 1 of round k. Thus in Phase
2 of round k, (p,k,estimate?,tsp) is in ?S9Sc? [kj with tsp > r. It is easy to see
that there is no message (q,k,estimateq,tsq) in mSgSc?[kj for which tsq > k. Let
t be the largest t8q such that (q,k,estimateq,tsq) is in rnS9Sck[k]. Thus ? _ t ? k.
In Phase 2 of round k, ck executes ?St?matQ? estimateq where
7This property, called uniform a9reement, is stronger than the agreement requirement of
Consensus which applies only to correct processes.
36
(q,k,estimateq,t) is in ?S9$c? [kj. From Figure 3.6, it is clear that q adopted
estimateq as its estimate in Phase 3 of round t. Thus, the coordinator of round t
sent estimateq to q in Phase 2 of round t. Since r < ? ? k, by the induction hy-
pothesis, estimateq = estirnate?. Thus, ek sets eStirnatec? estimate? in Phase
2 of round k. This concludes the proof of the claim,
We now show that if a process decides a value, then it decides estimate?.
Suppose that some process p R-delivers (q, r'q,estimateq,decide), and thus decides
estimateq. Process q must have R-broadcast (q, Tq, estimateq,decide) in Phase 4
of round ?q' From Figure 3.6, q must have received n --H f messages of the type
(--H,rq,ack) in Phase 4 of round rq. By the definition ofr, r < r'q. From the above
claim, estimateq = estimate?.			[z
Lemma 18: Every correct process eventually decides some value.
PROOF: There are two possible cases:
1. Some correct process decides. It must have R-delivered some message of the
type (--H, --H,--H,decide). By the agreement property of Reliable Broadcast, all
correct processes eventually R-dehver this message and decide.
2. No correct process decides. We claim that no correct process remains blocked
forever at one of the wait statements. The proof is by contradiction. Let r'
be the smallest round number in which some correct process blocks forever
at one of the wait statements. Thus, all correct processes reach the end of
Phase 1 of round r: they all send a message of the type (--H, r, estima?e,--H)
to the current coordinator c= (? mod n) + 1. Therefore at least n --H f such
messages are sent to c. There are two cases to consider:
(a) Eventually, c receives those messages and replies by sending
r,estimate?). Thus, c does not block forever at the wait statement
in Phase 2.
c crashes.
37
In the first case, every correct process receives (c, r, estimate?). In the second
case, since ?8 satisfies stron9 completeness, for every correct process p there
is a time after which c is permanently suspected by p, i.e., c ? ?S?. Thus in
either case, no correct process blocks at the second wait statement (Phase 3).
So every correct process sends a message of the type (--H, ?, ack) or (--H, ?, nack)
to c in Phase 3. Since there are n --H f correct processes, c cannot block at the
wait statement of Phase 4. This shows that all correct processes complete
round ? a contradiction that completes the proof of our claim.
Since ?8 satisfies eventual weak accuracy, there is a correct process q and a
time t such that no correct process suspects q after t. Thus, all processes that
suspect q after time t eventually crash and there is a time t' after which no
process sends a message of the type (--H, r, nack) where q is the coordinator
of round r (i.e., q = (r mod n) t 1). From this and the above claim, there
must be a round r such that:
(a) All correct processes reach round r after time t' (when no process sus-
pects q).
q is the coordinator of round r (i.e., q = (r mod n) + 1).
In Phase 1 of round r, all correct processes send their estimates to q. In
Phase 2, q receives n --H f such estimates, and sends (q, r, estimateq) to all
processes. In Phase 3, since q is not suspected by any correct process after
time t, every correct process waits for q's estimate, eventually receives it,
and replies with an ack to q. Furthermore, no process sends a nack to
q (that can only happen when a process suspects q). Thus in Phase 4, q
receives n --H f messages of the type (--H, ?, ack) (and no messages of the type
r, nack)), and q R-broadcasts (q, r, estimateq, decide). By the validity
property of Reliable Broadcast, eventually all correct processes R-deliver q's
message and decide a contradiction. Thus case 2 is impossible, and this
38
concludes the proof of the lemma.
[z
Theorem 19: Given any Eventually Strong Failure Detector ?8, the algorithm
in Figure 3.6 solves Consensus in asynchronous systems with f <
PROOF:
Termination: by Lemma 18.
Agreement: by Lemma 17
Uniform integrity: It is clear from the algorithm that no process decides more than
once,
Uniform validity: from the algoritlim, it is clear that all the estimates that a
coordinator receives in Phase 2 are proposed values. Therefore, the decision
value that a coordinator selects from these estimates must be the value
proposed by some process. Thus, uniform validity is satisfied. [z
By Theorems 5 and 19, we have:
Corollary 20: Given any Eventually Weak Failure Detector ?W, Consensus is
solvable in asynchronous systems with f ?
Thus, the weakest failure detector considered in this thesis, ?W, is sufficient to
solve Consensus in asynchronous systems. This leads to the following question:
What is the weakest failure detector for solving Consensus? Using the concept of
reducibility, in Chapter 4 we show that ?W is indeed the weakest failure detector
for solving Consensus in asynchronous systems with a majority of correct processes.
More precisely, we show:
Theorem 21: If a failure detector D can be used to solve Consensus in an asyn-
chronous system, then D ? ?)/V in that system.
39
By Corollary 20 and Theorem 21, we have:
Corollary 22: ?)4? is the weakest failure detector for solving Consensus in an
asynchronous system with f <
3.5.3 A lower bound on fault-tolerance
In Section 3.5.1, we showed that failure detectors with perpetual accuracy (i.e.,
?, Q, 8, or W) can be used to solve Consensus in asynchronous systems with
an? number of failures. In contrast, with failure detectors with eventual accuracy
(i.e., ??, ?Q, ?8, or ?VV), our Consensus algorithms required a majonty of the
processes to be correct. We now show that this requirement is necessary: Any
algorithm that uses ?? (the strongest of our four failure detectors with eventual
accuracy) to solve Consensus requires a majority of correct processes. Thus, the
algorithm in Figure 3.6 is optimal with respect to fault-tolerance.
Theorem 23: There is an Eventually Perfect Failure Detector ?? such that there
is no algorithm A which solves Consensus using ?? in asynchronous systems with
f>L?21
PROOF: We now describe the behaviour of an Eventually Perfect Failure Detector
?? such that with every algorithm A, there is a run RA of A using ?? that does
not satisfy the specification of Consensus. Partition the processes into two sets 11o
and Hi such that 11o contains [?n2? processes, and Hi contains the remaining [?n21
processes. Consider any Consensus algorithin A, and the following two runs of A
using ?P:
o+ Run Ro = (F0, H0, I, s0, T0?: All processes in H0 propose 0, and all processes
in Hi propose 1. All processes in H0 are correct in F0, while those in Hi crash
in F0 at the beginning of the run, i.e., Vt ? T : F0(t) = Hi (this is possible
since f > [?n21) Every process in 11o permanently suspects every process in
40
Hi, i.e., Vt c T, Vp c 11o : H0(p,t) = Hi. In this run, it is clear that ??
satisfies the specification of an Eventually Perfect Failure Detector.
o+ Run 1?1 = (P1, H1, I, S'i, Ti?: As in 1?0, all processes in H0 propose 0, and all
processes in Hi propose 1. All processes in Hi are correct in F1, while those
in H0 crash in Fi at the beginning of the run, i.e., Vt ? ?: Fi(t) = H0. Every
process in Hi permanently suspects every process in H0, i.e., Vt ? T, Vp ?
Hi : Hi(p, t) = H0. Clearly, ?? satisfies the specification of an Eventually
Perfect Failure Detector in this run.
Assume, without loss of generality, that both 1?0 and 1?i satisfy the specifica-
tions of Consensus. Let qo c H0, qi ? Hi, to be the time at which qo decides in 1?0,
and ti be the time at which qi decides in Ri. There are three possible cases in
each case we construct a run 1?A = (PA, HA, 1A, 5A TA? of algorithm A using ??
such that ?? satisfies the specification of an Eventually Perfect Failure Detector,
but 1?A violates the specification of Consensus.
1. In 1?0, qo decides 1. Let 1?A = (P0, H0, 1A, ?0, T0? be a run identical to 1?0
except that all processes in H1 propose 0. Since in F0 the processes in H1
crash right from the beginning of the run, 1?o and 1?A are indistinguishable to
qo. Thus, qo decides 1 in 1?A (as it did in 1?o), thereby violating the uniform
validity condition of Consensus.
2. In 1?i, qi decides 0. This case is symmetric to Case 1.
3. In 1?o, qo decides 0, and in Ri, qi decides 1. Construct 1?A = (PA, HA, I, 5A, TA?
as follows. No processes crash in PA, i.e., Vt C T : FA(t) = Q). As before, all
processes in H0 propose 0 and all processes in H1 propose 1. All messages
from processes in H0 to those in Hi and vice-versa, are delayed until time
max(t0, ti). Until time max(t0, ti), every process in H0 suspects every pro-
cess in Hi, and every process in Hi suspects every process in H0. After time
max(t0, ti), no process suspects any other process, i.e.:
41
Vt < max(t0,t1)
Vp ? Ho: HA(p,t)			Hi
V]) ? Hi : HA(p,t)			11o
Vt> ma4t0,?),Vp E II: HA(p,t)
Clearly, ?? satisfies the specification of an Eventually Perfect Failure De-
tector.
Until time max(t0, ti), 1?A is indistinguishable from 1?0 for processes in 11o
and 1?A is indistinguishable from 1?1 for processes in Hi. Thus in run 1?A,
qo decides 0 at time to, while qi decides 1 at time ti. So qo and qi decide
differently in 1?A, and this violates the agreement condition of Consensus. ?
In the Appendix, we refine the result of Theorem 23, by considering an infinite
hierarchy of failure detectors ordered by the number of mistakes they can make,
and showing exactly where in this hierarchy the majority requirement becomes
necessary for solving Consensus (this hierarchy contains all eight failure detectors
that we defined in Figure 3.1). Note that Theorem 23 is also acorollary of Theorem
4.3 in [DLS88] together with Theorem 66.
3.6 On Atomic Broadcast
We now consider Atomic Broadcast, another fundamental probleni in fault tol-
erant distributed computing, and show that our results on Consensus also apply
to Atomic Broadcast. Informally, Atomic Broadcast requires that all correct pro-
cesses deliver the same messages in the same order. Formally, Atomic Broadcast
is a Reliable Broadcast that satisfies:
o+ Total order: If two correct processes p and q deliver two messages ni and m',
then p delivers m before m' if and only if q delivers m before ri'
42
Total order and agreement ensure that all correct processes deliver the same se-
quence of messages. Atomic Broadcast is a powerful communication paradigm
for fault-tolerant distributed computing [CM84, CASD85, BJ87, PGM89, BGT9O,
GSTC90,Sch9O]. We now show that Consensus and Atomic Broadcast are equiva-
lent in asynchronous systems with crash failures. This is shown by reducing each
to the other.8 In other words, a solution for one automatically yields a solution for
the other. Both reductions apply to any asynchronous system (in particular, they
do not require the assumption of a failure detector). This equivalence has impor-
tant consequences regarding the solvability of Atomic Broadcast in asunchronous
systems:
1. Atomic Broadcast cannot be solved with a deterministic algorithm in asyn-
chronous systems, even if we assume that at most one process may fail, and
it can only fail by crashing. This is because Consensus has no deterministic
solution in such systems [F'LP85].
2. Atomic Broadcast can be solved using randomization or unreliable failure
detectors in asynchronous systems. This is because Consensus is solvable
with these techniques in such systems (for a survey of randomized Consensus
algorithms, see [CD89]).
Consensus can be easily reduced to Atomic Broadcast as follows. To propose a
value, a process atomically broadcasts it. To decide a value, a process picks the
value of the first message that it atomically delivers.9 By total order of Atomic
Broadcast, all correct processes deliver the same first message. Hence they choose
the same value and agreement of Consensus is satisfied. The other properties of
Consensus are also easy to verify. In the next section, we reduce Atomic Broadcast
to Consensus.
8They are actually equivalent even in asynchronous systems with arbitrary failures. However,
the reduction is more complex and is omitted here.
9Note that this reduction does not require the assumption of a failure detector.
43
3.6.1 Reducing Atomic Broadcast to Consensus
In Figure 3.7, we show how to transform any Consensus algorithm into an Atomic
Broadcast algorithm in asynchronous systems. The resulting Atomic Broadcast
algorithm tolerates as many faulty processes as the given Consensus algoritlim.
The reduction uses Rehable Broadcast, and repeated (possibly concurrent, but
completely independent) executions of Consensus. Processes disambiguate between
these executions by tagging all the messages pertaining to the kth execution of Con-
sensus with the number k. Tagging each message with such a number constitutes
a minor modification to any given Consensus algorithm. Informally, the kth execu-
tion of Consensus is used to decide on the kth batch of messages to be atomically
delivered. The propose and decide primitives corresponding to the kth execution
of Consensus are denoted by prvpose(k, --H) and decide(k, --H).
Our Atomic Broadcast algorithm uses the R-brnadcast(m) and R-deliver(m)
primitives of Reliable Broadcast. To avoid possible ambiguities between Atomic
Broadcast and Reliable Broadcast, we say that a process A-broadcasts or A-delivers
to refer to a broadcast or a delivery associated with Atomic Broadcast; and R-
broadcasts or R-delivers to refer to a broadcast or delivery associated with Reliable
Broadcast.
When a process intends to A-broadcast a message m, it R-broadcasts m (in
Task 1). When a process p R-delivers m, it adds m to the set R?delivered? (Task
2). Thus, R?delivered? contains all the messages submitted for Atomic Broadcast
(since the beginning) that p is currently aware of. When p A-delivers a message m,
it adds ni to the set A?delivered? (in Task 3). Thus, R-delivered? --H A?delivered?
is the set of messages that were submitted for Atomic Broadcast but not yet A-
delivered by p. This set is denoted by A?undelivered?. When A?vndelivered?
is not empty, p proposes A?undelivered? as the next batch of messages to be A-
delivered. batch?(k) denotes the kth batch of messages that p A-delivers: it is
msgSetp, the set of messages agreed upon by the kth execution of Consensus,
44
Every process p executes the following:
Initialization:
R?(iel?vered
A?delive?ed &
k&O
To execute A-bwadcast(m):
R-bwadcast(m)
f Task 1 1
A-deliver(m) occurs as follows:
when R-deliver(m)
R?delivered			J??delzve?ed u trnl
when 1??delivered --H A?delivered ?
k&k+1
A?unde1ivered			R-delive?ed --H A?delive?ed
propose(k, A?undelivered)
wait until decide(k, ms?Set)
batch(k)			msgSet --H A?delivered
atomically deliver all messages in batch(k) in some deterministic order
A?delive?ed			A?delivercd u batch(k)
Figure 3.7: Using Consensus to solve Atomic Broadcast
Task 2 1
f Task 3 1
minus A?delzvered?, those messages that p has already A-delivered.'0
Lemma 24: For any two correct processes p and q, and any message ni, if rn
R-delivered? then eventually m Ei J?Aeliveredq.
PROOF: If m Ei R-delivered? then p R-delivered m (in Task 2). Since p is correct,
by agreement of Reliable Broadcast q eventually R-delivers m, and inserts m into
R?delive?'edq.			Li
10It is possible for a process p to A-deliver a message m before it R-delivers m. This occurs if
rn was proposed by another process, and agreed upon by Consensus, before p R-delivers rn.
45
Lemma 25: For any two correct processes p and q, and all k> 1:
1. If p executes p?opose(k, --H), then q eventually executes propose(k, --H).
2. Ifp A-delivers messages in ba?ch?(k), then q eventually A-delivers messages
in batchq(k), and batcIi?(k) = batchq(k).
PR?OOF?: The proof is by simultaneous induction on (1) and (2). For k = 1, we
first show that if p executes propose(1, --H), then q eventually executes propose(1, --H).
When p executes proposc(1, --H), R-de1ive?ed? must contain some message m. By
Lemma 24, m is eventually in 1??del?veredq. Since A?dc1z'vercdq is initially empty,
eventually 1?Aeliveredq --H A?de1zveredq # ?. Thus, q eventually executes Task 3
and p?opose(1, --H).
We now show that if p A-delivers messages in batch?(1), then q eventually A-
delivers messages in batchq(1), and batch?(1) = batchq(1). From the algorithm,
if p A-delivers messages in batch?(1), it previously executed propose(1, --H). From
part (1) of the lemma, all correct processes eventually execute proposc(1, --H). By
termination and uniform integrity of Consensus, every correct process eventually
executes decide(1, --H) and it does so exactly once. By agreement of Consensus,
all correct processes eventually execute deczde(1, msgSe?) with the same ms?Set.
Since A?de1ivered? and A?de1?ve?edq are initially empty, batch?(1) = batchq(1) =
msgSetp = msgSetq.
Now assume that the lemma holds for all k 1 < k ? 1. We first show that
if p executes p?opose(l, --H), then q eventually executes p?opose(l, --H). When p ex-
ecutes pwpose(1, --H), 1??de1ivcred? must contain some message ni that is not in
A?de1ivered?. Thus, m is not in u1FW1 batch?(k). By the induction hypothesis,
baich?(k) = batchq(k) for all 1 < k < 1 --H 1. So ni is not in Uk14i batchq(k).
Since m is in R?de1ive?ed?, by Lemma 24, ni is eventually in J??de1ive?edq. Thus,
there is a time after q A-delivers balchq(1 --H 1) such that there is a message in
R?de1iveredq --H A?de1ive?edq. So q eventually executes Task 3 and propose(1, --H).
46
We now show that if p A-delivers messages ill batch?(1), then q A-delivers
messages in batchq(1), and batch?(1) batchq(1). Since p A-delivers messages in
batch?(1), it must have executed propose(1, --H). By part (1) of this lemma, all
correct processes eventually execute propose(1, --H). By termination and uniform in-
tegrity of Consensus, every correct process eventually executes decide(1, --H) and it
does so exactly once. By agreement of Consensus, all correct processes eventually
execute decide(1, msgset) with the same msgSe?. Note that batch?(1) = msgSet --H
U1kffik batch?(k), and batchq(1) = rnsgSet --H U1kffi\ &atchq(k). By the induction hy-
pothesis, batch?(k) = batchq(k) for all 1 < k < 1 --H 1. Thus, batch?(1) = batchq(1).
Lemma 26: The algorithm in Figure 3.7 satisfies agreement and total order,
PROOF: Immediate from Lemma 25, and the fact that correct processes A-deliver
messages in each batch in the same deterministic order.
Lemma 27: (Validity) If a correct process A-broadcasts m, then all correct pro-
cesses eventually A-deliver m.
PROOF: The proof is by contradiction. Suppose soine correct process p A-broadcasts
ni, and some correct process never A-delivers m. By Lemma 26, no correct process
A-delivers ni.
By Task 1 of Figure 3.7, p R-broadcasts rn. By validity of Reliable Broadcast,
every correct process q eventually R-delivers m, and inserts m in R?de1?veredq
(Task 2). Since correct processes never A-deliver m, they never insert m in
A?de1ivered. Thus, for every correct process q, there is a time after which ni
is perrnanenfty in R?de1ive?edq --H A?de1iveredq. From Figure 3.7 and Lemma 25,
there is a k1, such that for all 1 > k1, all correct processes execute propose(1, --H),
and they do so with sets that always include rn.
Since all faulty processes eventually crash, there is a k2 such that no faulty
process executes p?opose(1, --H) with 1 > k2. Let k = max(k1, k2). Since all correct
47
processes execute pwpose(k, --H), by termination and agreement of Consensus, all
correct processes execute dec?de(k, msgSet) with the same msgSet. By uniform
validity of Consensus, some process q executed pwpose(k, msgSet). From our
definition of k, q is correct and msgSet contains m. Thus all correct processes
A-deliver m a contradiction that concludes the proof.
Lemma 28: (Uniform integrity) For any message ni, each process A-delivers m
at most once, and only if m was A-broadcast by some process.
PROOF: Suppose a process p A-delivers m. After p A-delivers rn, it inserts m in
A?de1ivered?. From the algorithm, it is clear that p cannot A-deliver m again.
From the algoritlim, p executed decide(k, msgSet) for some k and some msgSet
that contains m. By uniform validity of Consensus, some process q must have
executed propose(k, msgset). So q previously R-delivered all the messages in
rnsgSet, including rn. By uniform integrity of Rehable Broadcast, some process ?
R-broadcast m. So, r A-broadcast m.
Theorem 29: Consider any system (synchronous or asynchronous) subject to
crash failures and where Reliable Broadcast can be implemented. The algorithm
in Figure 3.7 transforms any algorithm for Consensus into an Atomic Broadcast
algorithm.
PROOF: Immediate from Lemmata 2(3, 27, and 28.
Since Reliable Broadcast can be implemented in asynchronous systems with
crash failures (Section 3.3), the above theorem shows that Atomic Broadcast is
reducible to Consensus in those systems. As we argued earlier, the converse is also
true. Thus:
Corollary 30: Consensus and Atomic Broadcast are equivalent in asynchronous
systems with crash failures.
48
The equivalence of Consensus and Atomic Broadcast in asynchronous systems im-
mediately implies that our results regarding Consensus (in particular Corollaries
15 and 22, and Theorem 23) also hold for Atomic Broadcast:
Corollary 31: Given any Weak Failure Detector YV, Atomic Broadcast is solvable
in asynchronous systems with f <n.
Corollary 32: ?W is the weakest failure detector for solving Atomic Broadcast
in an asynchronous system with f ?
Corollary 33: There is an Eventually Perfect Failure Detector ?? such that
there is no algorithm A which solves Atomic Broadcast using ?? in asynchronous
systems with f >
Furthermore, Theorem 29 shows that by "plugging in" any ?ndornized Consen-
sus algorithm (such as the ones in [CD89]) into the algorithm of Figure 3.7, we
automatically get a randomized algorithm for Atomic Broadcast in asynchronous
systems.
Corollary 34: Atomic Broadcast can be solved by randomized algorithms in
asynchronous systems with f ? ?2 crash failures.
Chapter 4
The weakest failure detector for
solving Consensus
In the previous chapter, we showed that Consensus can be solved in asynchronous
systems using unreliable failure detectors. In particular, we showed how to solve
Consensus using any Eventually Weak Failure Detector. Recall that D is an Even-
tually Weak Failure Detector if, for every failure pattern F, every failure detector
history H ? D(F) satisfies the following two properties:
o+ Weak completeness: There is a time after which every process that crashes
in F is permanently suspected in H by some process that is correct in F.
o+ Eventual weak accufacy: There is a time after which some process that is
correct in F is never suspected in H by any process that is correct in F?
In the previous chapter, ?W denoted any Eventually Weak Failure Detector. In
this chapter, we reserve ?W to denote a particular Eventually Weak Failure Detec-
tor: For any failure pattern F, ?A?(F) consists of all failure detector histories that
satisfy the two properties above. By definition, if D is any Eventually Weak Fail-
ure Detector, then for any failure pattern P, D(F) c ?W(F). Hence, D ?
(by the identity transformation). This shows that ?W is the weakest among all
49
50
Eventually Weak Failure Detectors.
From Theorem 19, the failure detection properties of ?14' are sufficient to
solve Consensus in asynchronous systems (in which a majority of the processes are
correct). In this chapter we show that they are they necessa?y: We prove that
?YV is reducible to any failure detector D that can be used to solve Consensus
(this result holds for any asynchronous system). We show this reduction by giving
a distributed algorithm TD?cw that transforms any such D into ?W. Therefore,
?W is indeed the weakest failure detector that can be used to solve Consensus in
asynchronous systems with n > 2f. Furthermore if n ? 2f, any failure detector
that can be used to solve Consensus must be strictly stronger than ?W.
The task of transforming any given failure detector D (that can be used to
solve Consensus) into ?W runs into a serious technical difficulty for the following
reasons:
o+ To strengthen our result, we do not restrict the output of D to lists of sus-
pects. Instead, this output can be any value that encodes some information
about failures. For example, a failure detector D should be allowed to output
any boolean formula, such as "(not p) and (q or ?)" (i.e., p is up and either
q or r has crashed)?or any encoding of such a formula. Indeed, the output
of D could be an arbitrarily complex (and unknown) encoding of failure in-
formation. Our transformation from D into ?>V must be able to decode this
information.
o+ Even if the failure information provided by D is not encoded, it is not clear
how to extract from it the failure detection properties of ?14'. Consequently,
if D is given in isolation, the task of transforming it into ?W may not be
possible.
Fortunately, since D can be used to solve Consensus, there is a corresponding al-
gorithm, Consensusv, that is somehow able to "decode" the information about
51
failures provided by D, and knows how to use it to solve Consensus. Our reduc-
tion algoritlim, Tv??w uses Consensusp to extract this information from D and
transforms it into the properties of ?W.
Since this chapter focuses on the proof of a subtle lower bound, we need to
extend our model of distributed computation of Section 3.1 and make it more pre-
cise. This extended model, presented in the next section, introduces the following
new concepts:
o+ The range of values output by a failure detector: As mentioned earlier, the
main result of this chapter applies to failure detectors with arbitrary sets of
output values. We formalise this by associating each failure detector with a
range of output values.
o+ The enrnronrnent: In the previous chapter, we proved that any algorithm that
uses ?W to solve Consensus requires n > 2f. With other failure detectors the
requirements may be different. For example, there is a failure detector that
can be used to solve Consensus only if P1 and P2 do not both crash. In general
whether a given failure detector can be used to solve Consensus depends upon
assumptions about the underlying "environment". An environment (of an
asynchronous system) is set of possible failure patterns. We will show that
if D can be used to solve Consensus in some environment ?, then there is a
transformation algorithm TD??w that transforms D into ?)/V in ?.
4.1 The model
4.1.1 Failure detectors
Associated with each failure detector is a range 1Z of values output by that failure
detector. A failure detector history H with range 1Z is a function from II x T to
?. As before, H(p, t) is the value of the failure detector module of process p at
time t. A failure detector D is a function that maps each failure pattern F to a set
52
of failure detector histories with range ?? (where ?D denotes the range of failure
detector outputs of D). D(F) denotes the set of possible failure detector histories
permitted by D for the failure pattern F.
In Chapter 3, we considered a special class of failure detectors. Each failure
detector module output a set ofprocesses that are suspected to have crashed. In
other words, for each failure detector D, ?D = 2n
We now formally define the failure detector ?W mentioned in the introduction.
Each failure detector module of ?>V outputs a set of processes that are suspected
to have crashed: ??? = 2n For each failure pattern F, ?W(F) is the set
of all failure detector histories H?w with range l??w that satisfy the following
properties:
1. There is a time after which every process that crashes in F is always suspected
by some process that is correct in P:
T, Vp c crashed(F), ?q c co?rect(F),Vt' > t : p c H?w(q,t')
2. There is a time after which some process that is correct in P is never suspected
by any process that is correct in F:
Bt E I,?p ? correct(F), Vq ? correct(F), Vt' > t : p ? H?w(q,t')
4.1.2 Algoritlims
We model the asynchronous communication channels as a message buffer which
contains messages of the form (p, data,q) indicating that process p has sent data
addressed to process qand qhas not yet received that message. An algorithmAis a
collection of n deterministic automata, one for each of the processes. A(p) denotes
the automaton running on process p. Computation proceeds in steps oftke 9zven
algorithm A. In each step of A, process p performs atomically the following three
phases:
53
Receive phase: p receives a single message of the form (q,data,p) from the mes-
sage buffer, or a "null" message, denoted A, meaning that no message is
received by p during this step.
Failure detector query phase: p queries and receives a value from its failure
detector module. We say that p sees a value d when the value returned by
p's failure detector module is d.
Send phase: pchanges its state and sends a message to all the processes according
to the automaton A(p), based on its state at the beginning of the step, the
message received in the receive phase, and the value that p sees in the failure
detector query phase.1
The message actually received by the process p in the receive phase is chosen
non-determin?stically from amongst the messages in the message buffer destined
to p, and the null message A. The null message may be received even if there
are messages in the message buffer that are destined to p: the fact that m is in
the message buffer merely indicates that m was sent to p. Since ours will be a
model of asynchronous systems, where messages may experience arbitrary (but
finite) delays, the amount of time m may remain in the message buffer before
it is received is unbounded. Indeed, our model will allow a message sent later
than another to be received earlier than the other. Though message delays are
arbitrary, we also want them to be finite. We model this by introducing a liveness
assumption: every message sent will eventually be received, provided its recipient
makes "sufficiently many" attempts to receive messages. All this will be made
more precise later.
We also remark that the non-determinism arising from the choice of the message
to be received reflects the asynchrony of the message buffer it is not due to non-
?In the send phase, psends a message to all the processes atomically. As was shown in [FLP85j,
the ability to do so is not sufficient for solving Consensus. An alternative formulation of a step
could restrict a process to sending a message to a single process in the send phase. We can show
that both formulations are equivalent for our purposes (see Section 4.6.1).
54
deterministic choices made by the process. The automaton A(p) is deterministic
in the sense that the message that p sends in a step and p's new state are uniquely
determined from the present state of p, the message p received during the step and
the failure detector value seen by p during the step.
To keep things simple we assume that a process p sends a message m to q at
most once. This allows us to speak of the contents of the message buffer as a set,
rather than a multiset. We can easily enforce this by adding a counter to each
message sent by p to q so this assumption does not damage generality.
4.1.3 Configurations, runs and environments
A configuration is a pair (s, M), where 5 is a function mapping each process p to
Its local state, and M is a set of triples of the form (q, data, p) representing the
messages presently in the message buffer. An initial configuration of an algorithm
A is a configuration (s, M), where s(p) is an initial state of A(p) and M = ?. A
step of a given algorithm A transforms one configuration to another. A step of
A is uniquely determined by the identity of the process p that takes the step, the
message m received by p during that step, and the failure detector value d seen by
p during the step. Thus, we identify a step of A with a tuple (p, m, d, A). If the
message received in that step is the null message, then m = A, otherwise m is of the
type (--H, --H,p). We say that a step e = (p, m, d, A) is applicable to a configuration
C = (s,M) if and only if m e M u fA?. We write e(C) to denote the unique
configuration that results when e is applied to C.
A schedule S of algorithrn A is a finite or infinite sequence of steps of A. S1
denotes the empty schedule. We say that a schedule S of an algorithm A is
applicable to a configuration C if and only if (a) S = S1, or (b) &[1] is applicable
to C, S[2] is applicable to S[lj(C), etc.2 If S is a finite schedule applicable to C,
S(C) denotes the unique configuration that results from applying S to C. Note
2We denote by v[iJ the ith element of a sequence v.
55
51(C) C for all configurations C. We say that C' is a configuration of (5, C) if
there is a prefix 5' of 5 such that C'
A partial run of algorntkm A using a failure detector D is a tuple
II = (P, HD, I, 5, T? where F is a failure pattern, HD ? D(F) is a failure de-
tector history, I is an initial configuration of A, 5 is a finite schedule of A, and T
is a finite list of increasing time values (indicating when each step in 5 occurred)
such that 5 = ITI, 5 is applicable to I, and for all i ? 5?, if 5[ij is of the form
(I), m, d, A) then:
o+ p has not crashed by time T[i], i.e., p ? F(T[iJ)
o+ d is the value of the failure detector module of p at time T[iJ, i.e., d =
HD(p, T[i])
Informally, a partial run of A using D represents a finite point of some execution
of A using D.
A run of an algornthm A using a failure detector D is a tuple R = (F, Hv, I, 5, T?
where F is a failure pattern, HD C D(F) is a failure detector history, I is an ini-
tial configuration of A, 5 is an infinite schedule of A, and T is an infinite list of
increasing time values indicating when each step in 5 occurred. In addition to sat-
isfying the above properties of a partial run, a run must also satisfy the following
properties:
o+ Every correct process takes an infinite number of steps in 5. Formally:
Vp c correct(F), Vi, ?J > i : 5[j] is of the type(p, --H,
o+ Every message sent to a correct process is eventually received. Formally:
Vp ? correct(F),VC = (s,M) of (5, I): m = (q,data,p) ? M ?
5[i] is of the type (p, m, --H,A))
56
In the Appendix, we prove that any algorithm that uses ?W to solve Consensus
requires n > 2f. With other failure detectors the requirements may be different.
For example, there is a failure detector that can be used to solve Consensus only
if Pi and P2 do not both crash. In general whether a given failure detector can
be used to solve Consensus depends upon assumptions about the underlying "en-
vironment". Formally, an environment ? (of an asynchronous system) is set of
possible failure patterns.3
4.2 The Consensus problem
In this chapter, we study a weaker form of Consensus than that defined in Chapter
3. Since we prove a lower bound, this only strengthens our result. We first define
this weaker Consensus and then describe how it differs from the previous version.
With the Consensus problem in this chapter, each process p has an initial value,
o or 1, and must reach an irrevocable decision on one of these values. Thus, the
algorithm of process p, A(p), has two distinct initial states j?0 and cr?i signifying
that p's initial value is 0 or 1. A(p) also has two disjoint sets of decision states ??0
and ??i If p enters a state in ??k' we say that p has decided k.
We say that algorithm A uses failu?e detector D to solve Consensus in envz-
ronment ? if every run R (F, HD, 1, s, T) of A using D where F ? ? satisfies:
Termination: Every correct process eventually decides some value. Formally:
Vp ? correct(F), ?C = (s, M) of (s, I): s(p) ? ?P0 ?
Validity: If a correct process decides v, then v was proposed by some process.
Formally, let 1 = (so, Mo):
3In a synchronous system, assumptions about the underlying environment may also include
other characteristics such as the relative process speeds, the maximum message delay, the degree
of clock synchronization, etc. In such a system, a more elaborate definition of an environment
would be required.
57
Vp Ei correct(F), Vk ? ?O, 11: = (s, M) of (S, I):
s(p) Ei ??k) ?			? II : so(q) =
Uniform Integrity: Every process decides at most once. Formally:
Vp c II, Vk Ei ?O, 1?, V?, S[?J(I) = (s, M), V?' > ?, S[?'j(I) = (s', A")
s(p) c			? s'(p) ?
Agreement: No two correct processes decide differently. Formally:
Vp,p' Ei correct(F),VC = (s,M) of(S,I),Vk,k' El fO,lJ
El ??k A s(p') El ??k1) ? k =
The Consensus problem defined above differs from the previous chapter's version
in the following ways:
o+ In the previous chapter, Consensus is multi-valued, i.e., each process is al-
lowed to propose a value from an arbitrary range. In this chapter, Consensus
is binary, i.e., each process can only propose 0 or 1. It is not difficult to show
that both versions of Consensus are equivalent (i.e., reducible to each other)
in asynchronous systems.
o+ The validity property of the above Consensus imposes a requirement on the
decision value of correct processes only. In contrast, the uniform validity
property of the previous version of Consensus also imposed this requirement
on every faulty process that decides.
o+ In the previous chapter, processes could execute several independent in-
stances of Consensus. For each such instance, every process makes a distinct
proposal and reaches a corresponding decision.4 In this chapter, processes
4The ability to solve "repeated" Consensus was crucial to the reduction of Atomic Broadcast
to Consensus (see the algorithm in Figure 3.7).
58
execute a single instance of Consensus. Thus, we can model the unique pro-
posal of each process as its initial value.
4.3 Reducibility
We now define what it means for an algorithm TD?Th to transform a failure detector
D into another failure detector D'in an environment ?. Algorithm TD??D! uses D
to maintain a variable output? at every process p. Algorithm TD??D! transforrns D
into D' in S if and only if for every run R = (F, HD, I, 5', T? of TD??DI using D,
where F ? ?, outou? ? D'(F). If there is an algorithm TD??D! that transforms D
into D'in ?, we write D ?? D' and say that D' is reducible to D in ?; we also say
that D' is weake? than D in g
4.4 An outline of the result
In Section 3.5.2 we showed that ?W can be used to solve Consensus in any environ-
ment in which n > 2f. We now show that ?`4' is weaker than any failure detector
that can be used to solve Consensus. This result holds for any environment S. To-
gether with the result in Section 3.5.2, this implies that ?W is indeed the weakest
failure detector that can be used to solve Consensus in any environment in which
n> 2f.
To prove our result, we first define a new failure detector, denoted ?, that is
at least as strong as ?)`V. We then show that any failure detector D that can be
used to solve Consensus is at least as strong as ?. Thus, D is at least as strong as
The output of the failure detector module of ? at a process p is a single process,
q, that p currently considers to be correct; we say that p trusts q. In this case,
= II. For each failure pattern F, Q(F) is the set of all failure detector histories
H? with range 1?? that satisfy the following property:
59
o+ There is a time after which all the correct processes always trust the same
correct process:
?t c T,Bq c correct(F),Vp ? co??ect(F),Vt' > t : H?(p,t') --H--H q
As with ?)`V, the output of the failure detector module of Q at a process p may
change with time, i.e., p may trust different processes at different times. Further-
more, at any given time t, processes p and q may trust different processes.
Theorem 35: For all environments ? Q ?&
PROOF: [Sketch] The reduction algorithm T???w that transforms Q into ?W
is as follows. Each process p periodically sets 0UtPUtp H II --H fq?, where q is
the process that p currently trusts according to ?. It is easy to see that (in any
environment ?) this output satisfies the two properties of ?W. El
Theorem 36: For all environments ?, if a failure detector D can be used to solve
Consensus in ?, then D ??
PROOF: The reduction algorithm TD?+? is shown in Section 4.5. It is the core of
our result.			El
Corollary 37: For all environments ?, if a failure detector D can be used to solve
Consensus in ?, then D ??
PROOF: If D can be used to solve Consensus in ?, then, by Theorem 36, D --H?? ?.
From Theorem 35, ? ?s ??? By transitivity, D ?? ?W. El
In Section 3.5.2 we proved that, for all environments ? in which n > 2f, ?W can
be used to solve Consensus. Together with Corollary 37, this shows that:
Theorem 38: For all environments ? in which n> 2f, ?>V is the weakest failure
detector that can be used to solve Consensus in ?.
60
4.5 The reduction algorithm
Let & be an environment, D be a failure detector that can be used to solve Consen-
sus in ?, and Consensusv be the Consensus algorithm that uses D. We describe
an algorithm TD??? that transforms D into Q in ?. Intuitively, this algorithm
works as follows, Fix an arbitrary run of TD?? using D in ?, with failure pat-
tern P ? ?, and failure detector history HD ? D(F). We shall first construct an
infinite directed acyclic graph, denoted C, whose vertices are some of the failure
detector values that occur in HD, and whose edges are consistent with the time
at which these values occur, We then show that & induces a simulation forest T
that encodes an infinite set of possible runs of Consensu?. Finally, we show how
to extract from T the identity of a process p* that is correct in F.
The induced simulation forest is infinite and thus cannot be computed by any
process. Ilowever, the information needed to extract p* is present in a finite sub-
graph of the forest. It will be sufficient for each correct process I) to construct ever
increasing finite approximations of the simulation forest T that will eventually in-
clude this crucial finite subgraph. At all times, p uses its present approximation of
T to select the identity of some process: once p's approximation of T includes the
crucial finite subgraph, the selected process will be p* (forever). Thus, there is a
time after which all correct processes trust the same correct process, p*--Hwhich is
exactly what ? requires,
We say that a process is correct (crashes) if it is correct (crashes) in P. For
simplicity, we assume that a process p sees a value d at most once (this can be
enforced by tagging a counter to each value seen). For the rest of this chapter,
whenever we refer to a run of Consensusp, we mean a run of Consensus? using D.
Furthermore, we only consider schedules of Consensus?, and therefore we write
m, d) instead of (p, m, d, Consensus?) to denote a step.
61
4.5.1 A DAG and a forest
Given the failure pattern F and the corresponding failure detector history HD C
D(F) that were fixed above, let G be any infinite directed acyclic graph with the
following properties:
1. The vertices of C are of the form [p, dj where p ? II and d C ?D If [p,dJ is a
vertex of C, then there is a time t such that p ? F(t) and d = HD(p, t) (i.e.,
at time 1, p has not crashed and the value of p's failure detector module is
2. If [qi,di] [q2,d2J is an edge of C and d1 = H?(qi,ti) and d2 =
then t1 ? t2.
3. G is transitively closed.
4. Let p be any correct process and V be a finite subset of vertices of C. There is
a failure detector value d such that for all vertices [p', d'] in V, [p',d'] [p,d]
is an edge of G.
Note that such a DAG represents only a "sampling" of the failure detector values
that occur in HD. In particular, we do not require that it contain all the values
that occur in HD or that it relate (with an edge) all the values according to the
time at which they occur. However, Property 4 implies that the DAG contains
infinitely many "samplings" of the failure detector module of each correct process.
Lemma 39: Let V be any finite subset of vertices in G. C has an infinite path g
such that:
o+ There is an edge from every vertex of V to the first vertex of 9.
o+ If [p, --HJ is a vertex of g then p is correct; for each correct p, there are infinitely
many vertices [p,--HJ in 9.
62
PROOF: By repeated application of Property 4.			E]
Let g = [qi, d1], [q?, ....... be any (finite or infinite) path of C. A schedule 5 is
compatiblewithgifithasthesamelengthas9, andS = (qi,mi,di),(q2,m2,d2),...,
for some (possibly null) messages m1,m2,... We say that 5 is compatible with G
if it is compatible with some path of U.
Let I be any initial configuration of Consensus?. We define the simulation tree
Tk induced by C and I as follows. The vertices of ?k are the finite schedules 5
that are compatible with 0 and are applicable to I. The root of Tk is the empty
schedule 5i. There is an edge from vertex 5 to vertex 5' if and only if 5' = 5 e
for a step e;5 this edge is labeled e. Wfth each (finite or infinite) path in ?k. we
associate the unique schedule 5 = e1,e2,... ,???... consisting of the sequence of
labels of the edges on that path. Note that if a path starts from the root of Tk
and it is finite, the schedule 5 associated with it is also the last vertex of that path.
Lemma 40: 5 is a schedule associated with a path of ?k that starts from the
root if and only if 5 is a schedule compatible with 0 and applicable to I
PROOF: The lemma obviously holds if 5 is a finite schedule (this is immediate
from the definitions). Now let 5 = e1, .2.... ....... be an infinite schedule, where
= [q?,m?,d?J. We define 5o = ?i, Si = e1, 52 = Si e2, and in general
si = s?--H? ?i for all i = 1,2,...
Assume that 5 is compatible with 0 and applicable to I. We must show
that 5 is a schedule associated with a path of ?k that starts from the root. To
see this, note that for all i > 0, S? is a finite schedule that is also compatible
with 0 and applicable to I. Thus, all the schedules 5o, 51,52,... , 5i--Hi,S.,.. are
vertices of ?k Since S? = 5i--Hi ?i, the edge from 5i--Hi to S? is labeled ?i, for all
i > 1. Thus, 5 = e1, ?2,???,?i,?? is the schedule associated with the infinite path
5o			Si			52			...			5i--Hi			Si			... of Tk; this path starts from the root
5If ti, w are sequences and v is finite then v w denotes the concatenation of the two sequences.
63
so si.
Assume that 5 is a schedule associated with an infinite path of Tk that starts
from the root. We must show that 5 is compatible with G and is applicable to I
First note that for all i, 5? is a vertex in Tk, thus 5? is compatible with C and
is applicable to I. Since S? [qi,m1,d1],[q21m2,d2],... , [qi, mi, di] is compatible
with G, C must contain the path 7rj [qi, d1], [q?, ....... , [qi, di] (for all i). Note
that, for all i, iri+1 ?i [qi+i, di+i] is an extension of the path lrj in C. Therefore,
C contains the infinite path [qi, d1], [q?, ....... ,[qi,di],... So 5 is compatible with
C. Furthermore, since all 5j'5 are applicable to I, by definition of applicability,
the infinite schedule 5 is also applicable to I. Thus, 5 is compatible with C and
applicable to I.			E]
The following two lemmata show that the finite and infinite paths of Tk cor-
respond to partial runs and runs of Consensusp with initial configuration I.
Lemma 41: Let 5 be a schedule associated with a finite path of Tk that starts
from the root. There is a sequence of times T such that (F, HD, I, 5, T? is a partial
run of Consensus?.
PROOF: By Lemma 40, 5 is applicable to I and compatible with C. Thus 5 is
compatible with some finite path 9 [qi, d1], [q?, d2],. ..,[qi,di],... , [qk, dk] of G.
From Property 1 of C (applied to every vertex of the path 9), there is a sequence
T =tl,t2,...,ti,...,tk of times such that for all i, 1 ? i ? k di H?(qi,ti) and
qi ? F(ti). From Property 2 of G (applied to every edge of the path 9), for all i,
1 ? i < k, tj ? ti+i. Thus T is a sequence of increasing times, and, by definition,
(F, HD, I, 5, T) is a partial run of Consensu?.
Lemma 42: Let 5 be a schedule associated with an infinite path of ?k that starts
from the root. If in 5 every correct process takes an infinite number of steps and
every message sent to a correct process is eventually received, there is a sequence
of times T such that (F, HD, I, 5, T? is a run of Consensu?.
64
PROOF: Similar to Lemma 41
El
The following lemmata show some "richness" properties of the simulation trees
induced by C.
Lemma 43: For any two initial configurations I and I', if 5 is a vertex of Tk
and is applicable to I' then 5 is also a vertex of ?k
PROOF: Follows directly from the definitions.			El
Lemma 44: Let 5 be any vertex of Tk and p be any correct process. Let m be
a message in the message buffer of 5(1) addressed to p or the null message. For
some d, 5 has a child 5 (p, m, d) in Tk.
PROOF: From the definition of ?k 5 is compatible with some finite path 9 of 0
and applicable to I. Let v denote the last vertex of 9. By Property 4, there is
a d such that V H [p,d] is an edge of 0. Therefore, 9. [p,d] is a path of 0, and
5 (p, m, d) is compatible with 0.
It remains to show that 5 (p, rn, d) is applicable to I. Since 5 is applicable
to I, it suffices to show that (p, m, d) is applicable to 5(1). But this is true since,
by hypothesis, rn is in the message buffer of 5(1) and addressed to p, or the null
message.			El
Lemma 45: Let 5 be any vertex of Tk and p be any process. Let m be a message
in the message buffer of 5(1) addressed to p or the null message. Let 5' be a
descendent of 5 such that, for some d, 5'. (p, m, d) is in Tk. For each vertex 5"
on the path from 5 to 5' (inclusive), 5" (p, m, d) is also in ?k
PROOF: Since they are vertices of `k, 5, 5" and 5' (p, m, d) are compatible with
some finite paths 9, 9.9" and 9.9". 9'. [p,dj of 0, respectively. From Property
3 (transitive closure) of 0, 9. 9" [p,d] is also a path of 0. So 5" (p, m, d) is
compatible with this path of 0. We now show that 5" (p, m, d) is also applicable
65
to I, and therefore it is a vertex of Tk.
Since 5" is a vertex of Tk, 5" is applicable to I. If m A, then (p,m,d) is
obviously applicable to S"(I). Now suppose m # A. Since 5' (p, m, d) is a vertex
of Tk, (p, ra, d) is applicable to 5'(i), and thus m is in the message buffer of S'(I).
Since each message is sent at most once and m is in the message buffers of 5(1)
and S'(I), there is no edge of the type (p, m, --H) on the path from 5 to 5'. So m
is also in the message buffer of 5?(I), and (p, m, d) is applicable to 5"(I). z
Lemma 46: Let 5, So, and 5i be any vertices of Tk. There is a finite schedule
E containing only steps of correct processes such that:
1. 5 E is a vertex of ?k and all correct processes have decided in 5 E(I)
2. For i 0,1, if E is applicable to Si(I) then S? E is a vertex of ?k
PROOF: Since 5 is a vertex of ?k. 5 is compatible with some finite path 9 of
G and is applicable to I. Similarly, 5o and Si are compatible with some finite
path 90 and 9i, respectively, of C. From Lemma 39 (applied to the last vertices of
9,90 and 91), G has an infinite path 9oo [qi,d1],[q2,d2],... ,[w,dj],... with the
following two properties:
1. There is an edge from the last vertex of 9, go and 91 to the first vertex of g?.
(Thus, 9 9oo, go 9oo, and 91 9oo are infinite paths in C.)
2. If [p,--H] is a vertex of 9oo then p is correct; for each correct p, there are
infinitely many vertices [p,--H] in g?.
We now show how to construct the required schedule E. Consider the infinite
sequence of schedules ?0, ?i, 52,???,53,?? constructed by the algorithm in Figure
4.1. An easy induction shows that for all j > 0, 5? is applicable to I and is
compatible with 9. [qi, dij ... [w. dj], a prefix of the path 9. 9oo in G. So, for
all j > 0, S? is a vertex of Tk. Consider the infinite path of Tk that starts from
the root of Tk then goes to ?0 5, and then to 5?, .2...., S?,... The infinite
66
j?O
5 fS0 is compatibTh with 9 and apphcabTh to I?
repeat forever
j?j+1
Let [%`, dj] be the j-th vertex of path 9oo
Let rn) be the oldest message addressed to q? in the message buffer of Si--H1(I)
(if no such message exists, nij = A)
(%,mj,dj)
5j?1 e? (S? is compatible with 9 [qi, dij ... , [q?, dj] and applicable to I?
Figure 4.1: Generating schedule 5. Eoc, compatible with path 9. 9oc, in ?k
schedule associated with that path is 500 5. e1 - e2 -. ..`.?... Note that schedule
E00 e2 -...-??... is compatible with path g? of G. By Property (2) of path
g?, every correct process p takes an infinite number of steps in E00 (and thus also
in 500 = 5. E0o). Since in each one of these steps p receives the oldest message
that is addressed to it, every message sent to p (in 500) is eventually received. By
Lemma 42, there is a T such that R = (F, HD, I, 500, T? is a run of Consensu?.
From the termination requirement of Consensus, 5? has a finite prefix 5d such
that all correct processes have decided in 5d(1). There are two cases:
o+ 5d is a prefix of 5. Since decisions are irrevocable, all correct processes
remain decided in 5(1). Thus Si, the empty schedule, is the required E.
o+ 5 is a prefix of 5d Thus 5d --H 5- E where E is a finite prefix of E00. Since
E00 is compatible with g00, E is compatible with a prefix of 9oc- Now consider
5o (the following argument also applies to 5i)- Since 5o is compatible with
go, 5o - E is compatible with a prefix of go - 9oo, a path in G. So, 5o - E is
compatible with G. If 5o - 1' is also applicable to I, then, by the definition
of ?k, it is a vertex of ?k- The same argument holds for Si. It remains to
show that E contains only steps of correct processes. This is immediate from
Property (2) of 9oc and from the fact that E is compatible with a prefix of
g00.
67
Let 1? 0 ? i < n denote the initial configuration of Consensusv in which the
initial values of Pi ..Pi are 1, and the initial values of Pi+i p? are 0. The
simulation forest induced by G is the set ??k, ?k1, . , ?ry of simulation trees
induced by G and initial configurations 10, 11,... , In.
4.5.2 Tagging the simulation forest
We assign a set of tags to each vertex of every tree in the simulation forest induced
by G. Vertex 5 of tree Tk gets tag k if and only if it has a descendent 5' such
that some correct process has decided k in S'(I). liereafter, T? denotes the tagged
tree %, andY denotes the tagged simulation forest fY0, T',. . . ,
Lemma 47: Every vertex of T? has at least one tag.
PROOF: From Lemma 46, every vertex 5 of Y? has a descendent 5' = 5 E (for
some E) such that all correct processes have decided in 5t(1i).			E)
A vertex of T? is monovalent if it has only one tag, and bivalent if it has both
tags, 0 and 1. A vertex is 0-valent if it is monovalent and is tagged 0; 1-valent is
similarly defined.
Lemma 48: Every vertex of T? is either 0-valent, 1-valent, or bivalent.
PROOF: Immediate from Lemma 47.
Lemma 49: The ancestors of a bivalent vertex are bivalent. The descendents of
a k-valent vertex are k-valent.
PROOF: Immediate from the definitions.
Lemma 50: If vertex 5 of T? has tag k, then no correct process has decided 1 --H k
in 5(P).
PROOF: Since 5 has tag k, it has a descendent 5' such that a correct process p has
decided k in 51(Ji). From Lemma 41, there is a T such that 1? = (F, HD, I?, 5', T?
68
is a partial run of Consensu?. Since p has decided k in 51(1i), from the agreement
requirement of Consensus, no correct process has decided 1 k in 51(1i). Since 5'
is a descendent of 5, no correct process could have decided 1 --H k in 5(Ji).
Lemma 51: If vertex 5 of T? is bivalent, then no correct process has decided in
5(1i).
PROOF: Immediate from Lemma 50.
Recall that in 10 all processes have initial value 0, while in 1? they all have initial
value 1.
Lemma 52: The root of T0 is 0-valent; the root of T? is 1-valent.
PROOF: We first show that the root of T0 is 0-valent. Suppose, for contradiction,
that the root of T0 has tag 1. There must be a vertex 5 of T0 such that some
correct process has decided 1 in 5(10). From Lemma 41, there is a T such that
I? (F, HD, 10, 5, T? is a partial run of Consensusp. R violates the validity
requirement of Consensus a contradiction. Thus the root of T0 cannot have a
tag of 1. From Lemma 47, the root of T0 has at least one tag: thus it is 0-valent.
By a symmetric argument, the root of T? is 1-valent.
Index i is crntz'cal if the root of T? is bivalent, or if the root of yi--H1 is 0-valent
while the root of Ti is 1-valent. In the first case, we say that index i is bivaThnt
crntical; in the second case, we say that i is nionovalent crnticaf
Lemma 53: There is a critical index i, 0 ? i ?
PROOF: Apply Lemmata 48 and 52 to the roots of T0, ..... . , T?.
The critical index i is the key to extracting the identity of a correct process. In
fact, if i is monovalent critical, we shall prove that Pi must be correct (Lemma 55).
69
Root
0
5'
S.(p,m,d)
Figure 4.2: A fork--Hp is the deciding process
Root
0-			-			--
5'
to'
e
t(w
?p?vot?
St			5. (p,rn,d)
Figure 4.3: A hook?p is the deciding process
If i is bivalent critical, the correct process will be found by focusing on the tree y%,
as explained in the following section.
4.5.3 Of hooks and forks
We describe two types of finite subtrees of T? referred to as decision gadgets of Ti.
Each type of decision gadget is rooted at the root Si of T? and has exactly two
leaves: one 0-valent and one 1-valent. The least common ancestor of these leaves
is called the pivot The pivot is clearly bivalent.
The first type of decision gadget is called a fo?, and is shown in Figure 4.2.
70
The two leaves are children of the pivot, obtained by applying different steps of
the same process p. Process p is the dec?ding process of the fork, because its step
after the pivot determines the decision of correct processes.
The second type of decision gadget is called a hook, and is shown in Figure 4.3.
Let S be the pivot of the hook. There is a step e such that S e is one leaf, and
the other leaf is S (p, m, d) e for some p, m, d. Process p is the deciding process
of the hook, because the decision of correct processes is determined by whether p
takes the step (p, m, d) before e.
We shall prove that the deciding process p of a decision gadget must be correct
(Lemma 57). Intuitively, this is because if p crashes no process can figure out
whether p has taken the step that determines the decision value. The existence
df such a critical "hidden" step is also at the core of many impossibihty proofs
starting with [FLP85]. In our case, the "hiding" is more difficult because now
processes have recourse to the failure detector D. Despite this, the hiding of the
step of the deciding process of a decision gadget is still possible.
Lemma 54: If index i is bivalent critical then T? has at least one decision gadget
(and hence a deciding process).
PROOF: Starting from the bivalent root of T?, we generate a path ir in T?, all
the vertices of which are bivalent, as follows. We consider all correct processes
in round-robin fashion. Suppose we have generated path S so far, and it is the
turn of process p. Let m be the the oldest message destined to p that is in the
message buffer of s(Ii).6 (If no such message exists, we take m to be the null
message.) We try to extend the path S so that the last edge in the extension
corresponds to p receiving m and the target of that edge is a bivalent vertex. The
path construction ends if and when such an extension is no longer possible. This
construction is shown in Figure 4.4. Each iteration of the loop extends the path
5By a slight abuse of notation we identify a finite path from the root and its associated
schedule.
71
5+-			sj			?Si is the bivalent wot of T??
repeat forever
Let p be the next correct process in round-robin order
Let m be the oldest message addressed to p in the message buffer of 5(li)
(if no such message exists, m A)
if 3 has a descendent 5' such that, for some d, 5' (p, m, d) is a bivalent
vertex in T? then 5			5' (p, ni, d)			f5 is bivaThntj
else exit
Figure 4.4: Generating path r in T?
by at least one edge. Let r be the path generated by these iterations; r is finite
or infinite depending on whether the loop terminates.
Claim 1: ir is finite.
PROOF: Suppose, for contradiction, that ir is infinite. Let 5 be the schedule asso-
ciated with ?. By construction, in 5 every correct process takes an infinite number
of steps and every message sent to a correct process is eventually received. By
Lemma 42, there is aT such that R (F,HD,P,5,T? is a run of Consensu?.
By construction, all vertices in ir are bivalent. By Lemma 51, no correct process
decides in R, thus violating the termination requirement of Consensus?a contra-
diction.			?claim 1
Let 5 be the last vertex of Ir (clearly, 5 is bivalent). Let p be the next correct
process in round-robin order when the loop in Figure 4.4 terminates. Let m be the
oldest message addressed to p in the message buffer of 5(li) (if no such message
exists, ni is the null message). The loop exit condition and Lemma 48 imply that
All descendents 5' (p, m, --H) of 5 are monovalent.
From Lemma 44, for some d, 5 has a child 5 (p, m, d) in Ti. By (*), 5. (p, m, d)
is monovalent. Without loss of generality, assume it is 0-valent.
72
qSi
Fork (for case 1)
0-valent 0?(P?m?d)$ S (bivalent)
(p,
So (pivot of hook)
0
0-valent
1-valent
0
1-valent
llook (for case 2)
Figure 4.5: The decision gadgets in T? if? is bivalent critical
73
Claim 2: For some d' there is a descendent 5' of 5 such that 5' (p, rn, d') is a
1-valent vertex of T?, and the path from 5 to 5' contains no edge labeled (p, rn, --H).
PROOF: Since 5 is bivalent, it has a descendent 5* such that some correct process
has decided 1 in 5*(1i), From Lemmata 47 and 50, S? is 1-valent. There are two
cases:
1. The path from 5 to 5? does not have an edge labeled (p, rn, --H). Suppose
m $ A. Since m is in the message buffer of 5(P) and p does not receive m
in the path from 5 to 5*, m is still in the message buffer of 5*(1i), From
Lemma 44, for some d', 5*. (p, ni, d') is in T?. Since 5* is 1-valent, by Lemma
49, 5*. (p, m, d') is also 1-valent. In this case, the required 5' is 5*,
2. The path from 5 to 5* has an edge labeled (p, m, --H). Let (p, m, d') be the
first such edge on that path. Let S' be the source of this edge. By (*),
S' (p, m, d') is monovalent. Since 5' (p, m, d') has a 1-valent descendent 5*,
by Lemma 49, 5' (p, m, d') is 1-valent. ?c1aim 2
Consider the vertex 5' and edge (p, m, d') of Claim 2. By Lemma 45, for each
vertex 5" on the path from 5 to 5' (inclusive), 5" (p, ?, d') is also in T?. By
(*), all such vertices 5" (p, m, d') are monovalent. In particular, 5. (p, m, d') is
monovalent. There are two cases (see Figure 4.5):
1. 5. (p, m, d') is 1-valent. Since 5 (p, m, d) is 0-valent, T? has a fork with pivot
5,
2. 5. (p, m, d') is 0-valent. Recail that 5' (p, m, d') is 1-valent and for each
vertex 5" between 5 and 5', 5" (p, m, d') is monovalent. Thus, the path
from 5 to 5' must have two vertices 5o and 5i such that 5o is the parent of
5i, 5o (p,m,d') is 0-valent and 5i (p,m,d') is 1-valent. Hence, T? has a
hook with pivot ?o
74
4.5.4 Extracting the correct process
By Lemma 53, there is a critical index i. If i is monovalent critical, Lemma 55
below shows how to extract a correct process. If i is bivalent critical, a correct
process can be found by applying Lemmata 54 and 57
Lemma 55: If index i is monovalent critical then p? is correct.
PROOF: Suppose, for contradiction, that p? crashes. By Lemma 46(1) (applied
to the root S S1 of Ti), there is a finite schedule E that contains only steps
of correct processes (and hence no step of p?) such that all correct processes have
decided in E(I?). Since index i is monovalent critical, the root S1 of T? is 1-valent.
llence all correct processes must have decided 1 in E(I?).
I? and 1i--Hi only differ in the state of p?. Since S is applicable to Ii and does not
contain any steps of p?, an easy induction on the number of steps in 5' shows that:
(a) 5' is also applicable to ji--Hi, and (b) the state of all processes other than p? are
the same in 3(li) and 5'(Ji--Hi), Using Lemma 43, (a) implies that 5' is also a vertex
of T??1. By (b), all correct processes have decided 1 in 5'(Ji--Hi), Thus the root
of yi--H1 has tag 1. Since i is monovalent critical, the root of yi--H1 is 0-valent--Ha
contradiction.			El
Lemma 56: Let 5' be any bivalent vertex of T?, and 5'o, 5'i be any 0-valent and
1-valent descendents of 5'. If the paths from 5' to 5'o and from 5' to 5'i contain only
steps of the form (p, --H, --H), then p is correct.
PROOF: Suppose, for contradiction, that p crashes. From Lemma 46, there is a
schedule E containing only steps of correct processes (and hence no step of p) such
that:
1. 5' E is a vertex of T? and all correct processes have decided in 5' E(Ii).
2. For k 0,1, if 5'k E is applicable to Ii then 5'k E is a vertex of T?.
75
PS'
Si (1-valent)9?
(bivalent)
0 S0 (0-valent)
$ 5 E (Correct processes assumed to have
decided 0)
Si E
Figure 4,6: Lemma 53
Without loss of generality assume that all correct processes decided 0 in 5
Refer to Figure 4.6. Since all steps in the path from 5 to Si are steps of p, the state
of every process other than p is the same in 5(fi) and in Si(I?). Furthermore, any
message addressed to a process other than p that is in the message buffer in 5(1?)
is still in the message buffer in Si(I?). Since E is applicable to S(I?) and does not
contain any steps of p, an easy induction on the number of steps in E shows that:
(a) E is also applicable to 51(1?), and (b) the state of every process other than p is
the same in 5 p(1i) and Si p(1i). By (ii), (a) implies that Si E(P) is a vertex
in T?, By (b), all correct processes decide 0 in Si E(P). Thus Si, has tag 0. But
Si is 1-valent a contradiction.			E]
Lemma 57: The deciding process of a decision gadget is correct.
PROOF: Let ? be any decision gadget of T?. There are two cases to consider:
1. ? is a fork. By Lemma 56, the deciding process of ? is correct.
2. ? is a hook. Assume (without loss of generality) that 5 is the pivot of ?,
So = 5' (p', ?t, d') is the 0-valent leaf of ? and Si = 5' (p, m, d) (p', m', d')
76
As'
$ 5 (bivalent)
So = 5 ?,rn1,d1)Q
(0-valent)
1 =			p,m,
I			(1-valent)
So E
Si E
Figure 4.7: Lemma 54
is the 1-valent leaf of ? (see Figure 4.7). There are two cases:
(a) p = p'. By Lemma 56, P is correct.
(b) p # PI. Suppose, for contradiction, that p crashes. By Lemma 46, there
is a schedule E containing only steps of correct processes (and hence no
step of p) such that:
i. 5o E is a vertex of T? and all correct processes have decided in
So E(I?). Since 5o is 0-valent, all correct processes must have
decided 0 in 5o
ii. if E is applicable to 51(li) then Si E is a vertex of T?
Let 5' = 5 (p, m, d) be the parent of Si. The state of every process
other than P is the same in 5(li) and 5I(fi) Furthermore, any message
addressed to a process other than p that is in the message buffer in 5(Ji)
77
is still in the message buffer in 5!(Ji), Therefore, since S0 S'(p', m', d')
and Si = 5'. (p', m', d'), the state of every process other than p is the
same in 50(li) and 51(Ji), In addition, any message addressed to a
process other than p that is in the message buffer in 50(li) is also in the
message buffer in 51(li), Since E is applicable to 50(1i) and does not
contain any steps of p, an easy induction on the number of steps in E
shows that: (a) P is also applicable to 51(li), and (b) the state of every
process other than p is the same in So E(Ii) and Si E(Ii). By (ii),
(a) implies that Si E is a vertex of T?. By (b), all correct processes
decide 0 in Si .E(Ii). Thus Si, receives a tag off). But Si is 1-valent--Ha
contradiction,			El
There may be several critical indices and several decision gadgets in the simulation
forest. Thus, Lemmata 55 and 57 may identify many correct processes. Our
selection rule will choose one of these, as the failure detector ? requires, as follows.
It first determines the smallest critical index i. If i is monovalent critical, it selects
p?. If, on the other hand, i is bivalent critical, it chooses the "smallest" decision
gadget in T? according to some encoding of gadgets, and selects the corresponding
deciding process. It is easy to encode finite graphs as natural numbers. Since a
decision gadget is just a finite graph, the selection rule can use any such encoding.
The whole method of selecting a correct process is shown in Figure 4.8.
Theorem 58: The algorithm in Figure 4.8 selects a correct process.
PROOF: By Lemma 53, there is a critical index ?, 0 ? i ? n. If i is monovalent
critical, Line 2 returns p? which, by Lemma 55, is correct. If i is bivalent critical,
by Lemma 54, T? contains at least one decision gadget. Let ? be the decision
gadget in T? with the smallest encoding. By Lemma 57, the deciding process of ?
is correct in F. Thus, Line 3 returns the identity of a process that is correct. El
78
fBuzld and tag szmulation forest Y induced b? C?
for ? & ....... ,
Ti			simulation tree induced by G and I?
for every vertex 5 of T?
if 5 has a descendent 5' such that a correct process has decided k in 5t(Ji)
then add tag k to 5
f5elect a process from tagged simulation forest T?
smallest critical index
if i is monovalent critical then return p?
else return deciding process of the smallest decision gadget in T?
Figure 4,8: Selecting a correct process
(1)
(2)
(3)
4.5.5 The reduction algorithm TD???
The selection of a correct process described in Figure 4.8 is not yet the distributed
algorithm TD??? that we are seeking: it involves an infinite simulation forest and is
"centralized". To turn it into a distributed algoritlim, we will modify it as follows.
Each process will cooperate with other processes to construct ever increasing fi
nite approximations of the simulation forest. Such approximations will eventually
contain the decision gadget and the other tagging information necessary to extract
the identity of the same correct process chosen by the selection method in Figure
4.8.
Note that the selection method in Figure 4.8 involves three stages: The con-
struction of G, a graph representing samples of failure detector values and their
temporal relationship; the construction and tagging of the simulation forest in-
duced by G; and, finally, the selection of a correct process using this forest.
Algorithm TD?? consists of two components. In the first component, each
process repeatedly queries its failure detector module and sends the failure detector
values it sees to the other processes. This component enables correct processes to
construct ever increasing finite approximations of the same G. Since all inter-
process communication occurs in this component, we call it the communication
79
fBuild the directed ac?c1ic graph U?1
Up empty graph
repeat forever
RECEIVE PHASE:
p receives m
FAILURE DETECTOR QUERY PHASE:
dp query failure detector D
SEND PHASE:
if m is of the form (q, Uq,p) then
Up & Up U Uq
add [p, dpj to Up and edges from all other vertices of Up to
0UtpUtp computation component fFigure 4.10J
p sends (p,Up,q) to all q ? II
Figure 4.9: Process p's communication component
(1)
(2)
(3)
(4)
component of TD???.
In the second component, each process repeatedly (a) constructs and tags the
simulation forest induced by its current approximation of U, and (b) selects the
identity of a process using its current simulation forest. Since this component does
not require any communication, we call it the computation component of TD??.
The communication component
In this component processes cooperate to construct ever increasing approximations
of the same U. Let Up denote p's current approximation of U. Roughly speaking,
each process p periodically performs the following two tasks: (i) If p receives Uq
for some q, it incorporates this information by replacing Up with the union of Up
and Uq. (ii) Process p queries its own failure detector module. Let d be the value
that it sees and [pt, d!j be any vertex currently in Up. Clearly, p saw d after p'
sawd'. Thuspadds[p,d]toUp, with edges from all other vertices of Up to [p,d].
Process p then sends its updated Up to all other processes. The communication
component of TD??? for p is shown in Figure 4.9.
Let Up(t) denote the value of Up at time t. If p takes a step at time t, Up(t)
80
denotes the value of Gp at the end of that step. The next two lemmata establish
certain useful properties of the graphs constructed by the communication compo-
nent. In reading the proofs of these results it will be useful to keep in mind that in
our model the three phases of a step receive, failure detection query, and send
occur atomically at a single time t.
Lemma 59: Let v be a vertex contained in some local graph during the execution
of the communication component. Let Cp(t) be the first graph that contains v.
(That is, v is in Gp(t), but not in Cq(t'), for any process q and time t' ? t.) Then
1. v = [p,dj, and p saw d at time t.
2. If v H v is an edge contained in some local graph during the execution of
the communication component then v			v is contained in Cp(t).
3. Up(t) is a subgraph of any graph that contains v.
PROOF: 1. Process p adds v into Cp(t) in Line (1) or (2). In the latter case, the
result follows immediately. In the former case, p must have received a message at
time t with a graph that contains v. The process that sent that message must have
therefore had v in its graph before time t, contradicting the choice of Cp(t) as the
first graph to contain v.
2. Consider the earliest time t' when the edge v v was added to some graph,
say of process q. By definition oft, t' > t. If t' > t, at time t' process p' receives a
message that contains a graph with the edge v v. The sender of that message
had a graph that contained the edge v H v at some time before t', contrary to the
choice of t'. Therefore it must be that t' = t. Then, by Part (1), q = p and so
v v is in Gp(t), as wanted.
3. Suppose, for contradiction, that some graph contains v but is not a supergraph
of Gp(t). Choose the first such graph, say, &q(t'). By definition of t, t' > t.
Clearly, q # p because p never removes any vertices or edges from its own graph.
81
Therefore, at time t1 process q receives a message with a graph that contains v but
is not a supergraph of Gp(t). The sender of that message must have had a graph
that contains v but is not a supergraph of Cp(t) before time jI, contrary to the
choice of Cq(t').
Recall that we are considering a fixed run of Tv?+?, with failure pattern F, and
failure detector history HD ? D(F). We now prove that the graphs constructed
by the communication component of TD?+? satisfy certain properties. The reader
should note the similarity between the first four and the four properties of the
graphs defined in Section 4.5.1.
Lemma 60: For any correct process p and time t:
1. The vertices of Cp(t) are of the form [p',d'] where p' c II and d' ? ?D If
[p', d'j is a vertex of Gp(t), then there is a time t' such that p' ? F(t') and
= HD(P1,[1).
2. If [qi,dij [q2,d2] is an edge of Cp(?) and d1 = H?(qi,ti) and d2 =
H?(q?,t?) then j1 ?
3. Cp(t) is transitively closed.
4. There is a time tI > t and a failure detector value d such that for all vertices
[p', d'j of Gp(t), [p',d'j [p, d] is an edge of Cp(t1).
5. Cp(t) is a subgraph of Cp(t'), for all j' > t.
6. For all correct q, there is a time t' > t such that Gp(t) is a subgraph of Gq(t').
PROOF:
Property 1 : Consider the first graph that contains the vertex [p',d']. By Lemma
59(1), this graph is Gp1(t1) for some time jI, and p1 saw d' at time t'. This
means that P1 ? F(t1) (otherwise P1 would not have taken a step at time t1
and would not have seen d1), and d1 = HD(P1, t1), as wanted.
82
Property 2 : By Lemma 59(2), [qi, d1] [q?, d2] is an edge of Gq2(t2). Let t' be
the time when q? inserted vertex [qi, d1] into Gq2. Of course, t' ? t2. There
are two cases:
1. t' ? t2. By Lemma 59(1), [qi,dij was not in any graph before time ?i.
Thus, [i ? t' and from the hypothesis of this case, t1 ? t2.
2. t' = t. Then q? received a graph containing [qi, d1] at t2. Let t11 be the
time when this graph was sent. Of course, t" < t2. By Lemma 59(1),
[qi, d1] was not in any graph before ti, and therefore ti ? t.". Thus,
ti <t2.
Property 3 : Let [qi, d1] H ... H [q?, dk] be a path in &p(t). We must show that
there is an edge [qi, d1] [qk, dk] in Cp(t).
Let tj be the time when q? inserted [q?, di] in C??, for 1 ? i ? k. By induction
on i we show that [qi,dij .. [q?,d?] is a path in Cq2(ti). The basis,
z = 1, is trivial. For the induction step, suppose that [qi, d1] H ...
[q??i,di?i] is a path in Gqj?1(ti?i). Since [qi?i,d??i] [q?,di] is an edge
in Gp(t), by Lemma 59(2), it is also an edge in Uqj(ti). Since [qi--Hi, di--Hil is
a vertex in Gq?(ti), by Lemma 59(3), Cqj?1(ti?i) is a subgraph of Gqj(ti).
In particular, Gqj(ti) contains the path [qi,d1] H [q??i,d??1]. Thus,
[qi, di] H H [q?, di] is a path in Cqj(ti), as wanted.
Therefore, the vertices [qi, d1],.. . , [qk, dk] are all in Cq?(t?). At time tk, qk
adds an edge from every other vertex to [q?, dk]. Thus, the edge [qi, d1]
[q?,d?] is in Cq?(t?). By Lemma 59(3), Gq?(tk) is a subgraph of Cp(t) (since
the latter contains [qk, dk]). Therefore, [qi, d1] [qk, dkl is in Cp(t), as
wanted.
Property 5 : Once a vertex or edge is added to Cp it is not removed.
83
Property 4 : Since p is correct, it takes a step at some time t' after t. In the
failure detector query phase of this step, p queries its failure detector module
and obtains a value, say d. In Line 2 of this step, p adds the vertex [p,d] to
Up and an edge from all other vertices of Up(t') to [p, d]. From Property 5,
Up(t) is a subgraph of Gp(t'), hence the result follows.
Property 6 : Since p is correct, it eventually sends Cp(t) to all processes, includ-
ing q (this occurs in p's first execution of Line 4 after time t). Since q is
correct, it eventually receives Gp(t), and then replaces Gq with Uq U Gp(t),
say at time t'. So, Up(t) is a subgraph of Cq(t').
Property 5 of the above lemma allows us to define = UtEr Cp(t). From
Property 6, we get:
Lemma 61: For any correct processes p and q, Cp? =
PROOF: Let 0 be any vertex or edge of Cp?, i.e., there is a time t at which 0 is
in Gp(t). From Lemma 60 (6), there is a time t' such that Up(t) is a subgraph
of Gq(t'). Thus 0 is in Gq?. Thus Cp? is a subgraph of 0q? By a symmetric
argument, Uq? is a subgraph of Gp?, hence = Q?.
Lemma 61 allows us to define the 1?mit graph G to be Cp? for any correct process
p. The first four properties of Lemma 60 immediately imply:
Lemma 62: The limit graph U satisfies the four properties of the DAG defined
in Section 4.5.1.
As before, T? denotes the tagged simulation tree induced by U and initial config-
uration J?, and T denotes the tagged simulation forest (T0, ..... . , Tn1.
84
The computation component
Since the limit graph C has the four properties of the DAG, we can apply the
"centralized" selection method of Figure 4.8 to identify a correct process. This
method involved:
o+ Constructing and tagging the infinite simulation forest T induced by G.
o+ Applying a rule to T to select a particular correct process p*
In the computation component of TD??, each p approximates the above method
by repeatedly:
o+ Constructing and tagging the finite simulation forest Tp induced by Gp, its
present finite approximation of G.
o+ Applying the same rule to T to select a particular process.
Since the limit of Tp over time is T, and the information necessary to select p*
is in a finite subgraph of T, we can show that eventually p will keep selecting the
correct process p*, forever.
Actually, p cannot quite use the tagging method of Figure 4,8: that method
requires knowing which processes are correct! Instead, p assigns tag k to a vertex 5
in Tp? if and only if 5 has a descendent 5' such that p itself has decided k in 51(1i).
If p is correct, this is eventually equivalent to the tagging method of Figure 4.8. If
p crashes, we do not care how it tags its forest, Also, p cannot use exactly the same
selection method as that of Figure 4.8: its current simulation forest Tp may not yet
have a critical index or contain any decision gadget (although it eventually will!).
In that case, p temporizes by just selecting itself. The computation component
of TD?? is shown in Figure 4.10 (compare it with the selection method of Figure
4.8).
We first show that Tp, the simulation forest that p constructs, is indeed an
increasingly accurate approximation of T (Lemma 63). We then show that the
85
IBuild and tag simulation forest Yp induced by Upi
for i 0,1,... ,
Tp? & simulation tree induced by Up and I?
for every vertex 5 of
if 5 has a descendent ?t such that p has decided k in 5f(Ji)
then add tag k to 5
?5elect a processfrom tagged simulation forest Tpl
if there is no critical index then return p
else
smallest critical index			(1)
if i is monovalent critical then return p?			(2)
else if has no decision gadgets then return p
else return deciding process of the smallest decision gadget in Tp?			(3)
Figure 4.10: Process p's computation component
tags that p gives to any vertex 5 in Tp are eventually the same ones that the
tagging rule of Figure 4.8 gives to 5 in T (Lemma 64). Let Yp(t) denote Tp at
time t, i.e., Tp(t) is the finite simulation forest induced by Gp(t).
Lemma 63: For any correct p and any time t:
1. Tp(t) is a subgraph7 of T
2. Tp(t) is a subgraph of Tp(t'), for all t' > t.
3. tAr Yp(t) T
PROOF:
Property 1 : Let 5 be any vertex of tree Tp?(t) (for some?, 0 ? i ? n). From the
definition of Tp?(t), 5 is compatible with some path g of Cp(t) and applicable
to I?. Since Gp(t) is a subgraph of U, g is also a path of U. Thus, 5 is
compatible with U; since it is also applicable to I?, it is a vertex of T?
7The subgraph and graph equality relations ignore the tags.
86
Similarly, let 5 5' be an edge e of Tp?(t). Since 5 and 5' are also vertices
of T?, and 5' = 5 e, 5 5' is also an edge of T?
Property 2 : Follows from Lemma 60 (5).
Property 3 : We first show that T is a subgraph of UtET Yp(?). Let 5 be any
vertex of any tree T? of T. From the definition of T?, 5 is compatible with
some finite path g of G and is applicable to I?. Since C = UtEr Cp(t) and
g is a finite path of G, there is a time t such that 9 is also a path of Gp(t).
Since 5 is compatible with g of Gp(t) and is applicable to I?, 5 is a vertex
of Tp?(t).
Let 5 H 5' be any edge e of T?. By the argument above, there is a time t
after which both 5 and 5' are vertices of Yp?. Since 5' = 5 e, after time I
the edge e is also in Tp? Thus, every vertex and every edge of T is also in
UtET Tp(1), i.e., T is a subgraph of UtET Tp(1).
By Property 1, UtET Tp(I) =
[z
Lemma 64: Let p be any correct process, and 5 be any vertex of Tp. There is a
time after which the tags of 5 in Tp are the same as the tags of 5 in T
PROOF: Suppose that at some time I, p assigns tag k to vertex 5 of tree Tp? This
means that 5 has a descendent 5' in T\)(t) such that p has decided k in 5t(1i). By
Lemma 63(1), 5' is also a descendent of 5 in T?, and since p is correct, 5 has tag
k in T? as well.
Conversely, suppose a vertex 5 of a tree T? of T has tag k. We show that,
eventually, p also assigns tag k to 5 in Tp?. Since 5 has tag k in T?, 5 has a
descendent 5' in T? such that some correct process has decided k in 5t(1i) (cf.
tagging rule in Figure 4.8). By Lemma 46(1), there is a descendent 5" of 5'
in T?, such that all correct processes, including p, have decided in 511(1i) By
Lemma 41, 5'1(1i) is a configuration of a partial run of Consensu?. By the
87
Agreement property of Consensus, p must have decided k in 51t(1i) Consider the
path that starts from the root of T? and goes to vertex S and then to 5". By
Lemma 63(3), there is a time t after which this path is also in Yp? Therefore, when
p executes the tagging rule of Figure 4.10 after time t, p assigns tag k to 5 in Yp?
(because p has decided k in 511(fi), and 5" is a descendent of 5 in Tp?) E]
Recall that p* is the correct process obtained by applying the selection rule of
Figure 4.8 to the infinite simulation forest T. We now show that there is a time after
which any correct p always selects p* when it applies the corresponding selection
rule of Figure 4.10 to its own finite approximation of the simulation forest Tp.
Roughly speaking, the reason is as follows. By Lemma 64, there is a time t after
which the tags of all the roots in p's forest Tp are the same as in the infinite
forest T. Since these tags determine the sets of monovalent and bivalent critical
indices, after time ? these sets according to p are the same as in T. Let ? be the
minimum critical index in these sets, and consider the situation after time t. If
i is monovalent critical, the selection rule of Figure 4.10 selects p?, which is what
p* is in this case. If i is bivalent critical, then p selects the deciding process of
its current minimum decision gadget of Tp? (if it has one). This case is examined
below.
Let ?? be the minimum decision gadget of T? (so, p* is the deciding process of
?*) For a while, ?? may not be the minimum decision gadget of T\). This may be
because ?? (and its tags) is not yet in Tp? However, by Lemmata 63(3) and 64,
?? (including its tags) will eventually be in T? Alternatively, it may be because
Yp? contains a subgraph ? whose encoding is smaller than ?*??, and for a while
? looks like a decision gadget according to its present tags. However, by Lemma
64, p will eventually determine all the tags of ?, and discover that ? is not really
a decision gadget. Since there are only finitely many graphs whose encoding is
smaller than ?*??, p will eventually discard all the "fake" decision gadgets (like ?)
that are smaller than ?y*, and then select ?? as its minimum decision gadget. After
88
that time, p always selects the deciding process of ?? which is precisely p*, in
this case.
Theorem 65: For all correct processes p, there is a tiine after which OUtPUtp =
forever.
PROOF: Let i* denote the critical index selected by Line 1 of Figure 4.8 applied
to T. By Lemma 64, there is a time tinit after which every root of Yp has the
same tags as the corresponding root of T. Thus after time tinit, p always sets
?= j*in Line 1 of Figure 4.10. We now show that there is a time after which the
computation component of p (Figure 4.10) always return p*. There are two cases:
1. i* is monovalent critical. In this case, p* is process ?j* (by Line 2 of the
selection rule Figure 4.8). Similarly, after time t???t: (a) p always sets i to ?*
(Line 1 of Figure 4.10); (b) p always returns ?j* (Line 2 of Figure 4.10).
2. i* is bivalent critical. Let ?? denote the smallest decision gadget of T? . In
this case, p* is the deciding process of ??. Since ?? is a finite subgraph of
T?*, by Lemma 63(3), there is a time after which ?? is also a subgraph of Tp?
By Lemma 64, there is a time t?* after which all the (finitely many) vertices
of ?? receive the same tags in T?* and ?;. Thus after time t?*, ?? is a also
decision gadget of
Since each graph is encoded as a unique natural number, there are finitely
many graphs with a smaller encoding than ??. Let q denote the set of graphs
with a smaller encoding than ??, and ? be any graph in ?. We show that
there is a time after which ? is not a decision gadget of ?;. There are two
cases:
(a) ? is not a subgraph of T?*. In this case, by Lemma 63(1), ? is never a
subgraph of Tp?*
89
(b) ? is a subgraph of T?*. Since ?? is the smallest decision gadget of T?*
and ? is smaller than ??, ? is not a decision gadget 0f??* By Lemma 64,
there is a time t? after which all the (finitely many) vertices of ? have
the same tags in T?* and Thus after time t?, ? is not a decision
gadget of
Since ? is finite, there is a time t? after which no graph in g is a decision
gadget of
Consider the process that is returned by the computation component of p
(Figure 4.10) at any time t > max(t???t,t?*,tg). Since t > tinit, P always sets
i to i* in Line 1. Since t > t?*, ?? is a decision gadget of Tp?(t) Finally, since
t > t?, ?? is the smallest decision gadget of Tp?(t). Thus, since i* is bivalent,
at any time after max(t???t, t?*, tg), Line 3 of Figure 4.10 returns the decid-
ing process of ?y* Therefore, after time max(t???t, t?*, tg), the computation
component of p always returns p*
From the above, there is a time after which p sets OUtPUtp H p* forever, in Line 3
of Figure 4.9.
We now have all the pieces needed to prove our main result, Theorem 36 in Sec-
tion 4.4:
Theorem 36: For all environments ?, if a failure detector D can be used to solve
Consensus in ?, then D ??
PROOF: Consider the execution of algorithm TD?+? in any environment ?. By
Theorem 65, there is a time after which all correct processes set output?
forever. By Theorem 58, p* is a correct process. Thus, TD?? is a reduction
algorithm that transforms D into Q. In other words, Q is reducible to D.
90
4.6 Discussion
4.6.1 Granularity of atomic actions
Our model incorporates very strong assumptions about the atomicity of steps.
First, the three phases of each step are assumed to occur indivisibly, and at a
single time. In particular, the failure of a process cannot happen in the "middle
of a step". This allows us to associate a single time t with a step and think of
the step as occurring at that time. Second, in the send phase of a step a message
is sent to all processes. Given that the entire step is indivisible, this means that
either all or none of the correct processes eventually receive the message sent in a
step. Finally, no two steps can occur at the same time.5 These assumptions are
convenient because they make the formal model simpler to describe. Also, they
are consistent with those made in the model of [FLP85j that provided the impetus
for this work.
On the other hand, in Chapter 3 a model with weaker properties is used. There,
the three phases of a step need not occur indivisibly, and may occur at different
times. Even within the send phase, the messages sent to the different processes
may be sent at different times. Thus, a failure may occur in the middle of the send
phase, resulting in some correct processes eventually receiving the messages sent
to them in that step while others never do. Also, actions of different processes
may take place simultaneously, subject to the restriction that a message can only
be received strictly after it was sent. Since Chapter 3 is mainly concerned with
showing how to use various types of failure detectors to achieve Consensus, the
use of a weaker model strengthens the results. (In fact, the negative results of
the Appendix hold even in the modd of this Chapter, with the stronger atomicity
assumptions.)
The question naturally arises whether our result also applies to this weaker
8This is reflected in our formal model by the fact that the list of times in a run (which indicate
when the events in the run's schedule occur) is increasing.
91
model. In other words, if a failure detector D can be used to solve Consensus in
the weak model, is it true that we can transform D to ?W in that rnodei? It
turns out that the answer is affirmative. To see this, first note that if D solves
Consensus in the weak model then it surely solves Consensus in the strong model.
By our result, D can be transformed to ?W in the strong model. It remains to
show that D can be transformed to ?W in the weak model. This is not obvious,
since it is conceivable that the extra properties of the strong model are crucial in
the transformation of D to ?14'. Fortunately, the transformation presented in this
chapter actually works even in the weak model!
To see this, it is sufficient to make sure that the communication component of
the transformation (cf. Figure 4.9 in Section 4.5.5) constructs graphs that satisfy
the properties listed in Lemma 60, even if we run it in the weak model. It is not
difficult to verify that this is indeed so. The proof is virtually the same, except for
the fact that we must distinguish the time tin which a process p queries its failure
detector and the time t'in which p adds the value it saw into O?. In our proof we
assume that t = t'; in the weak model we would have t ? t'. Similar comments
apply to all actions within a step that are no longer assumed to occur at the same
instant of time. These changes make the proofs slightly more cumbersome, since
we must introduce notation for all the different times in which relevant actions
within a step take place, but the reasoning remains essentially the same.9
Thus, our result is not merely a fortuitous consequence of some whimsical
choice of model. We view the robustness of the result across different models of
asynchrony as further testimony to the significance of the failure detector ?)`V.
9Another problem that must be confronted is that in the proofs of Lemmata 59 and 60 we
often refer to the "first graph" in which a vertex or edge is present. In the strong model there is
no difficulty with this, since processes cannot execute steps simultaneously. In the weak model,
we have to justify that it makes sense to speak of the "first" graph to contain a vertex or edge, in
spite of the fact that certain actions can be executed at the same time. The fact that a message
can be received only after it was sent is needed here.
92
4.6.2 Weak Consensus
[FLP85j actually showed that even the Weak Consensus problem cannot be solved
(deterministically) in an asynchronous system. Weak Consensus is like Consensus
except that the validity property is replaced by the following, weaker, property:
Non-triviality: There is a run of the protocol in which correct processes decide 0,
and a run in which correct processes decide 1.
Unlike validity, this property does not explicitly prescribe conditions under which
the correct processes must decide 0 or 1 it merely states that it is possible for
them to reach each of these decisions. It is natural to ask whether our result holds
for this weaker problem as well. That is, we would like to know if the following
holds:
Theorem: For all environments ?, if a failure detector D can be used to solve
Weak Consensus in ?, then D ?E
Under the above definition of non-triviality, this is not quite right. But as we shall
argue, the problem really lies with the definition! Under a slightly stronger defini-
tion, which is more appropriate for our model that incorporates failure detectors,
the theorem is actually true.
Intuitively, the problem with the above definition of non-triviality in our model
of failure detectors is that it is possible for the decision of correct processes to
depend entirely on the the values returned by the failure detector. Consider, for
example, a failure detector D so that for each failure pattern F, D(F) ?H0, H1?,
where for all processes p and times t, and for all i F fO, l?, Hj(p, t) i. In other
words, in any given run, this failure detector returns the same binary value to all
processes at all times, independent of the run's failure pattern. It is trivial to use
this failure detector to solve Weak Consensus: A process merely queries its failure
detector and decides the value returned! It is easy to see that D $? ?, for any
93
environment t: D provides absolutely no information about which processes are
correct or faulty.10
At this point, the reader may justifiably object that D is "cheating" it is
really not a failure detector, but a mechanism that non-deterministically chooses
the decision value. One possible way of fixing this problem would be to make our
definition of failure detector less general than it presently is. We could then try to
prove the theorem for this restricted definition of failure detectors. This approach,
however, is fraught with the danger of restricting the definition too much and ruling
out legitimate failure detectors in addition to bogus ones, like D. Intuitively, the
failure detector is supposed to provide some information about faulty processes.
As this information may be encoded in a complex way, we should not arbitrarily
rule out such encodings because, in doing so, we may be inadvertently ruling out
useful failure detectors.
Instead of modifying our definition of failure detector, we strengthen the non-
triviality property to require that the failure detector values seen by the processes
do not, by themselves, determine the decision value. To formalise this, let R be a
run of a Consensus algorithm, and (pi,rni,d1),(p2,m2,d2),..., be the schedule of
I?. We denote by fd(J?) the sequence [pi,d1][p2,d2j..., i.e., the sequence of failure
detector values seen by the processes in R. Consider the relation on runs, where
R R' if and only if fd(R) fd(J?'). It is immediate that is an equivalence
relation. We now redefine the non-triviality property in our model (where processes
have access to failure detectors) as follows:
Non?trivia1ity: In every equivalence class of the relation --H, there is a run of
the protocol in which correct processes decide 0, and a run in which correct
processes decide 1.
This captures the idea that the decision value cannot be ascertained merely on the
basis of the failure detector values seen by the processes. It must also depend on
10In fact, D cannot be used to solve Consensus.
94
other aspects of the run (such as the initial values, the particular messages sent,
or other features).
If we define Weak Consensus using this version of non-triviality, then the The-
orem stated above is, in fact, true. We briefly sketch the modifications of our proof
needed to obtain this strengthening of Theorem 36. The only use of the validity
property is in the proof of Lemma 52 which states that the root of T0 is 0-valent
and the root of T? is 1-valent. This, in turn, is used in the proof of Lemma 53,
which states that a critical index exists.
To prove the stronger theorem, we concentrate on the forest induced by all
initial configurations not just ?0,,,,, 1n, Thus, the forest now will have 2?
trees, rather than only n + 1. Consider the n initial values of processes in an initial
configuration as an n-bit vector, and fix any n-bit Gray code.11 Let 10,..., I2?--H1
be the initial configurations listed in the order specified by the Gray code, and T?
be the tree Tk, for all i Ei .0...., 2? --H 1J. We use the same definition for a critical
index as we had before: Index i c .0...', 2? --H 1? is critical if the root of T? is
bivalent or the root of T? is 1-valent while the root of yi--H1 is 0-valent. The only
difference is that we now take subtraction to be modulo 2?, so that when i 0,
i --H 1 = --H1 = 2? --H 1. We can now prove an analogue to Lemma 53.
Lemma: There is a critical index i, 0 < i ? 2? --H 1.
PROOF: First, we claim that that the forest contains both nodes tagged 0 and
nodes tagged 1. To see this, let S be a node in some tree of the forest. By
Lemma 47, S has a tag; without loss of generality, assume that S has tag 0.
Consider an infinite path that extends S. By Lemma 42 and the fact that S is
tagged 0, there is a run R of the Weak Consensus algorithm in which a correct
process decides 0. By non-triviality, there is a run 1?' = R so that correct processes
decide 1 in R'. Let S' be the infinite schedule and 1? be the initial configuration of
11An n-bit Gray code is a sequence of all possible n-bit vectors where successive vectors, as
well as the first and last vectors, differ only in the value of one position. It is well?known that
such codes exist for all n > 1.
95
run i?'. Using the definition of the =--H relation and the construction of the induced
forest, it is easy to show that every finite prefix of S' is a node of Te. Since correct
processes decide 1 in I?', all these nodes are tagged 1.
Since there are both nodes tagged 0 and nodes tagged 1, by Lemma 49, there are
both roots tagged 0 and roots tagged 1. If the root of some T? is tagged both 0 and
1, it is bivalent and we are done. Otherwise, the roots of all trees are monovalent,
and there are both 0- and 1-valent roots. Thus, there exist 0 ? i,j < 2? --H 1 so
that the root of T? is 0-valent and the root of T? is 1-valent. By considering the
sequence T?, yi+1,?? , Ti, (where addition is modulo 2n) it is easy to see that the
root of some yk, k # i, that appears in that sequence is 1-valent, while the root
of yk--Hl is 0-valent. By definition, k is a critical index.
The rest of the proof remains unchanged.
4.6.3 Failure detectors with infinite range of output
values
The failure detectors in [RB9lJ and Chapter 3 only output lists of processes sus-
pected to have crashed. Since the set of processes is finite, the range of possible
output values of these failure detectors is also finite. In this chapter our model
allows for failure detectors with arbitrary ranges of output values, including the
possibility of infinite ranges! We illustrate the significance of this generality by de-
scribing a natural class of failure detectors whose range of output values is infinite
(though each value output is finite).
Example: One apparent weakness with our formulation of failure detection is that
a brief change in the value output by a failure detector module may go unnoticed.
For example, process p's module of the given failure detector, D, may output d1
at time t1, d2 at a later time t2 and d1 again at time ?? after t2. If due to the
asynchrony of the system p does not take a step between time t1 and t3, p may
never notice that its failure detector module briefly output d2. A natural way
96
of overcoming this problem is to replace D with failure detector D' that has the
following property: D' maintains the same list of suspects as D but when queried,
D' returns the entire history of its list of suspects up to the present time. In this
manner, correct processes are guaranteed to notice every change in D"s list of
suspects. As the system continues executing, the values output by D' grow in size.
This means that D' has an infinite range of output values.
llowever, since D is a function of F, the failure pattern encountered, D' is also a
function of F, and can be described by our model. Thus, the result in this chapter
applies to D', a natural failure detector with infinite range of output values.
4.6.4 Open problems
In our model, each process "polls" its failure detector module each time it takes
a step. One can conceive of an alternative model in which the failure detector
module can "interrupt" a process; for example, it could issue an interrupt every
time there is a change in its list of suspects. We do not know if our results apply
to such a model.
In our model, the specification of a failure detector, (i.e., its allowable outputs)
depends only on failure patterns. In other words, the set of failure detector outputs
that are allowed for a given execution depends only on the identity of the processes
that crash and on the timing of these crashes in that execution.12 We have not
studied more general failure detectors whose specification also depends on other
aspects of executions, such as the timing and content of messages, the state of the
processes, etc. A major problem that needs to be resolved before this issue can be
pursued is to identify which aspects of an execution it would be "fair" to allow the
failure detector to depend on. Clearly, the failure detector cannot be allowed to
depend on the entire state of the system, for then it would be too powerful and,
therefore, of no practical value, as it would be impossible to implement it.
`2Recall that we allow the implementation of a failure detector to use any other aspect of the
given execution, such as the local time and/or order in which messages are sent and received.
97
In this thesis we have focused on failure detectors for systems with crash failures
only. Extending our results to other types of failures, such as ornission (cf. [MSF87,
Had84,PT86]) and arbitrary failures (cf. [PSL8D]), remains a goal for future work.
We showed that ?VY is weaker than any failure detector that can be used to
solve Consensus. Since ?W can be used to solve Consensus in any environment
in which n > 2J, it is the weakest failure detector for solving Consensus in such
an environment. In the Appendix we show that ?W cannot be used to solve
Consensus in environments in which n < 2f. It is still not known (a) whether
a weakest failure detector for solving Consensus in such environments exists, and
(b) provided it does, what its properties are.
Chapter 5
Related work
5.1 Partial synchrony
Fischer, Lynch and Paterson showed that Consensus cannot be solved in an asyn-
chronous system subject to crash failures [FLP85]. The fundamental reason why
Consensus cannot be solved in completely asynchronous systems is the fact that,
in such systems, it is impossible to reliably distinguish a process that has crashed
from one that is merely very slow. In other words, Consensus is unsolvable be-
cause accurate failure detection is impossible. On the other hand, it is well-known
that Consensus is solvable (deterministically) in completely synchronous systems
that is, systems where clocks are perfectly synchronised, all processes take steps
at the same rate and each message arrives at its destination a fixed and known
amount of time after it is sent. In such a system we can use timeouts to imple-
ment a "perfect" failure detector i.e., one in which no process is ever wrongly
suspected, and every faulty process is eventually suspected. Thus, the ability to
solve Consensus in a given system is intimately related to the failure detection
capabilities of that system. This realisation led us to augment the asynchronous
model of computation with unreliable failure detectors as described in this thesis.
A different tack on circumventing the unsolvability of Consensus is pursued in
[DDS87J and [DLS88]. The approach of those papers is based on the observation
98
99
that between the completely synchronous and completely asynchronous models
of distributed systems there lie a variety of intermediate "partially synchronous"
models.
In particular, [DDS87] defines a space of 32 models by considering five key pa-
rameters, each of which admits a "favourable" and an "unfavourable" setting. For
instance, one of the parameters is whether the maximum message delay is bounded
and known (favourable setting) or unbounded (unfavourable setting). Each of the
32 models corresponds to a particular setting of the 5 parameters. [DDS87] iden-
tifies four "minimal" models in which Consensus is solvable. These are minimal
in the sense that the weakening of any parameter from favourable to unfavourable
would yield a model of partial synchrony where Consensus is unsolvable. Thus,
within the space of the models considered, [DDS87] delineates precisely the bound-
ary between solvability and unsolvability of Consensus, and provides an answer to
the question "What is the least amount of synchrony sufficient to solve Consen-
sus?"
[DLS88] considers the following two models of partial synchrony. The first
model assumes that there are bounds on relative process speeds and on message
transmission times, but these bounds are not known. The second model assumes
that these bounds are known, but they hold only after some unknown time.
In each one of these two models (with crash failures), it is easy to implement
an Eventually Perfect Failure Detector ?P. In fact, we can implement ?? in an
even weaker model of partial synchrony: one in which there are bounds on message
transmission times and relative process speeds, but these bounds are not known
and they hold only after some unknown time. Since ?? is stronger than ??4',
by Corollaries 20 and 32, this implementation immediately gives Consensus and
Atomic Broadcast solutions for this model of partial synchrony and, a fortiori, for
the two models of [DL888]. The implementation of ?? is given in Figure 5.1, and
proven below.
loo
Every process p executes the following:
0UtPUtp
for all q ? II ???(q) denotes the du?tion of p `s time-out interva? for qi
Ap(q) H default time-out interval
cobegin
Task 1: repeat periodically
send "p-is-alive" message to all
Task 2: repeat periodically
for all q E II
if q ? OUtPUtp and p did not receive "q-is-alive" in the last Ap(q) seconds
0UtPutp H 0fltPfltp U ?q?
t? times-out on q: it now suspects q has crashedi
Task 3: when receive "q-is-alive" for some q
if q ? output? t? knows that it prematurely timed-out on
output? H output? --H ?q? p repents on q, and?
Ap(q)			??(q) + 1			?2 p increases its time-out period for q1
coend
Figure 5.1: A time-out based implementation of ?? in some models of partial
synclirony.
101
Each process p periodically sends a "p-is-alive" message to all the processes. If
p does not receive a "q-is-alive" message from some process q for A?(q) units of
time, p adds q to its list of suspects. If p receives "q-is-alive" from some process q
that it currently suspects, p knows that its previous time-out on q was premature.
In this case, p removes q from its list of suspects and increases the length of the
time-out.
Theorem 66: Consider a system in which, after some time t, some bounds on
relative process speeds and on message transmission times hold (we do not assume
that t or the value of these bounds are known). The algorithm in Figure 5.1 im-
plements an Eventually Perfect Failure Detector ?? in this system.
PROOF: (sketch) We first show that strong completeness holds, i.e., eventually
every process that crashes is permanently suspected by every correct process. Sup-
pose a process q crashes. Clearly, q eventually stops sending "q-is-alive" messages,
and there is a time after which no correct process receives such a message. Thus,
there is a time t' after which: (1) all correct processes time-out on q (Task 2), and
(2) they do not receive any message from qafter this time-out. From the algorithm,
it is clear that after time t', all correct processes will permanently suspect q. Thus,
strong completeness is satisfied.
We now show that eventual strong accuracy is satisfied. That is, for any correct
processes p and q, there is a time after which p will not suspect q. There are two
possible cases:
1. Process p times-out on q finitely often (in Task 2). Since q is correct and
keeps sending "q-is-alive" messages forever, eventually p receives one such
message after its last time-out on q. At this point, q is permanently removed
from p's list of suspects (Task 3).
2. Process p times-out on q infinitely often (in Task 2). Note that p times-out
on q (and so p adds q to OUtPUtp) only if q is not already in 0UtPUtp. Thus,
102
q is added to and removed from 0UtPUtp infinitely often, Process q is only
removed from QUtpUtp in Task 3, and every time this occurs the time-out
period A?(q) is increased. Since this occurs infinitely often, Ap(q) grows
unbounded. Thus, eventually (1) the bounds on relative process speeds and
on message transmission times hold, and (2) A?(q) is larger than the correct
time-out based on these bounds. After this point, p cannot time-out on q
any more a contradiction to our assumption that p times-out on q infinitely
often. Thus Case 2 cannot occur.			E]
Thus, failure detectors can be viewed as a more abstract and modular way of
incorporating partial synchrony assumptions into the model of computation. In-
stead of focusing on the operational j?atures of partial synchrony (such as the five
parameters considered in [DDS87J), we can consider the axiomatic properties that
failure detectors must have in order to solve Consensus. The problem of imple-
menting a given failure detector in a specific model of partial synchrony becomes
a separate issue; this separation affords greater modularity.
Studying failure detectors rather than various models of partial synchrony has
other advantages as well. By showing that Consensus is solvable using some specific
failure detector we thereby show that Consensus is solvable in all systems in which
that failure detector can be implemented. An algoritlim that relies on the axiomatic
properties of a given failure detector is more general, more modular, and simpler to
understand than one that relies directly on specific operational features of partial
synchrony (that can be used to implement the given failure detector).
From this more abstract point of view, the question "What is the least amount
of synchrony sufficient to solve Consensus?" translates to "What is the weakest
failure detector sufficient to solve Consensus?". In contrast to [DD587], which
identified a set of minimal models of partial synchrony in which Consensus is
solvable, we are able to exhibit a single minimum failure detector that can be used
to solve Consensus. The technical device that made this possible is the notion of
103
reduction between failure detectors. We suspect that a corresponding notion of
reduction between models of partial synchrony, although possible, would be more
complex. This is because there are models which are not comparable in general
(in the sense that there are tasks that are possible in one but not in the other and
vice versa), although they are comparable as far as failure detection is concerned
which is all that matters for solving Consensus! In this connection, it is useful
to recall our earlier observation, that the same failure detector can be implemented
in different (indeed, incomparable) models of partial synchrony.
5.2 The application of failure detection in
shared memory systems
Loui and Abu-Amara showed that in an asynchronous shared memory system with
atomic read/write registers, Consensus cannot be solved even if at most one process
may crash [LA87J. This raises the following natural question: can we circumvent
this impossibility result using unreliable failure detectors? In a recent work, Lo
shows that this is indeed possible [Lo93]. In particular, he shows that using a
Strong Failure Detector and atomic registers, one can solve Consensus for any
number of failures. He also shows that for systems with a majority of correct
processes, it is sufficient to use an Eventually Strong Failure Detector and atomic
registers.
5.3 The Isis toolkit
Isis is a programming toolkit for building fault-tolerant distributed systems [BJ87,
BCJ+90]. Although Isis employs the asynchronous model of distributed computing,
it also provides several powerful primitives, including Atomic Broadcast. Roughly
speaking, Isis has the following internal architecture. The lowest layer of Isis can
be modeled as an asynchronous system and a failure detector. Higher layers of Isis
104
use the lowest layer to implement several primitives including Atomic Broadcast.
In this section, we compare our model with the lowest layer of the Isis system.
?`fth our approach, even if a correct process p is repeatedly suspected to have
crashed by the other processes, it is still required to behave like every other correct
process in the system. For example, with Atomic Broadcast, p is still required to
A-deliver the same messages, in the same order, as all the other correct processes.
Furthermore, p is not prevented from A-broadcasting messages, and these messages
must eventually be A-delivered by all correct processes (including those processes
whose local failure detector modules permanently suspect p to have crashed). In
summary, application programs that use unreliable failure detection are aware that
the information they get from the failure detector may be incorrect: they only
take this information as an imperfect "hint" about which processes have really
crashed. Furthermore, processes are never discriminated against" if they are
falsely suspected to have crashed.
Isis takes an alternative approach based on the assumption that failure detectors
rarely make mistakes [R?B91]. In those cases in which a correct process p is falsely
suspected by the failure detector, p is effectively forced "to crash" (via a g?oup
membership protocol that removes p from all the groups that it belongs to). An
application using such a failure detector cannot distinguish between a faulty process
that really crashed, and a correct one that was forced to do so. Essentially, the Isis
failure detector forces the system to conform to its view. From the application's
point of view, this failure detector looks "perfect": it never makes visible mistakes.
For the Isis approach to work, the low-level time-outs used to detect crashes
must be set very conservatively: Premature time-outs are costly (each results in
the removal of a process), and too many of them can lead to system shutdown.1 In
contrast, with our approach, premature time-outs (e.g., failure detector mistakes)
are not so deleterious: they can only delay an application. In other words, pre-
1For example, the time?out period in the current version of Isis is greater than 10 seconds.
105
mature time-outs can affect the liveness but not the sajety of an application. For
example, consider the Atomic I3roadcast algorithm that uses ?W. If the failure
detector "malfunctions", some messages may be delayed, but no message is ever
delivered out of order, and no correct process is removed. If the failure detector
stops malfunctioning, outstanding messages are eventually delivered. Thus, we
can set time-out periods more aggressively than Isis: in practice, we would set
our failure detector time-out periods closer to the average case, while Isis must set
time-outs to the worst-case.
As we have seen, the approach taken in the lowest layer of the Isis system, is
fundamentally different from our approach. However, both approaches achieve the
same result: by augmenting the asynchronous model of computation with a failure
detection mechanism, they make it practical. Thus it would be interesting to study
whether the higher layers of Isis can be built based on our approach. This is left
as a subject for future research.
5.4 Other work
Several works in fault-tolerant computing used time-outs primarily or exclusively
for the purpose of failure detection. An example of this approach is given by an
algorithm in [ADLS9i], which, as pointed out by the authors, "can be viewed as
an asynchronous algorithm that uses a fault detection (e.g., timeout) mechanism."
Appendix A
A hierarchy of failure detectors
and bounds on fault-tolerance
In the preceding chapters, we introduced the concept of unreliable failure detectors
that could make mistakes, and showed how to use them to solve Consensus despite
such mistakes. Informally, a mistake occurs when a correct process is erroneously
added to the list of processes that are suspected to have crashed. In this Appendix,
we formalise this concept and study a related property that we call ?epentance. In-
formally, if a process p learns that its failure detector module Dp made a mistake,
repentance requires Dp to take corrective action. Based on mistakes and repen-
tance, we define a hierarchy of failure detector specifications that will be used to
unify some of our results, and to refine the lower bound on fault-tolerance given in
Section 3.5.3. This infinite hierarchy consists of a continuum of repentant failure
detectors ordered by the maximum number of mistakes that each one can make.
A.1 Mistakes and repentance
We now define a rnzstake. Let R (F, H, I, S, T? be any run using a failure
detector D. D makes a mistake in ? at time t on process p about process q if at
time t, p begins to suspect that q has crashed even though q ? F(t). Formally:
106
107
FQ),q E H(p,t)] and ?t' ? t,Vt11 t' < tll ? t : q ? H(p,ttt)]
Such a mistake is denoted by the tuple (?, p, q, t?. The set of mistakes made by D
in R is denoted by M(R).
Note that only the erroneous addition of q into Dp is counted as a mistake on p.
The continuous retention of q into Dp does not count as additional mistakes. Thus,
a failure detector can make multiple mistakes on a process p about another process
q only by repeatedly adding and then removing q from the set Dp. In practice,
mistakes are caused by premature time-outs.
We define the following four types of accuracy properties for a failure detector
D based on the mistakes made by D:
o+ Strongly k--H mistaken: D makes at most k mistakes. Formally, D is strongly
k--Hmistaken if:
VRusingD: M(J?) ?k
o+ Weakly k--Hmistaken: There is a correct process p such that D makes at most
k mistakes about p. Formally, D is weakly k--Hmistaken if:
VR = (F, H, 1, 5, T? using D, a? ? correct(F)
f(R,Qp,t? : (R,Qp,t? E M(R)iI ? k
o+ Strongly finitely mistaken: D makes a finite number of mistakes. Formally,
D is strongly finitely mistaken if:
VR using D: M(1?) is finite.
In this case, it is clear that there is a time t after which D stops making
mistakes.
108
o+ Weakly finitely mistaken: There is a correct process p such that D makes a
finite number of mistakes about p. Formalty, D is weakly finitely mistaken
if:
VR = (F, H, I, S, T? using D, ?p c correcl(F):
?(R,q,p,t? : (R,Qp,t? ? M(R)? is finite.
In this case, there is a time t after which D stops making mistakes about p.
For most values of k, the properties mentioned above are not powerful enough
to be useful. For example, suppose every process permanently suspects every other
process. In this case, the failure detector makes at most (n --H 1)2 mistakes, but it
is clearly useless since it does not provide any information.
The core of this problem is that such failure detectors are not forced to reverse
a mistake, even when a mistake becomes "obvious" (say, after a process q replies
to an inquiry that was sent to q after q was suspected to have crashed). However,
we can impose a natural requirement to circumvent this problem. Consider the
following scenario. The failure detector module at process p erroneously adds q
to Dp at time t. Subsequently, p sends a message to q and receives a reply. This
reply is a proof that q had not crashed at time t. Thus, p knows that its failure
detector module made a mistake about q. It is reasonable to require that, given
such irrefutable evidence of a mistake, the failure detector module at p takes the
corrective action of removing q from Dp. In general, we can require the following
property:
o+ Repentance: If a correct process p eventually knows that q ? F(t), then at
some time after t, q ? Dp. Formally, D is repentant if:
VR= (F,H,I,S,T? usingD,Vt,Vp,q? II:
(R, t') # K?(q ? F(t))] ? [?t11 > t : q ? lI(p, t")]
109
The knowledge theoretic operator Kp can be defined formally [llM9oj. Informally
(R, t) # ? iff in run I? at time t, predicate ? holds. We say (1?, t) Np (R', t') iff
the run R at time t and the run fi' at time t' are indistinguishable to p. Finally,
# Kp(?) w(1?',t') ? (1?,t) : (1?',t') ? ?. For a detailed treatment
of Knowledge Theory as applied to distributed systems, the reader should refer to
the seminal work done in [MDll86,HM9OJ.
Recall that in Section 3.1.2 we defined a failure detector to be a function that
maps each failure pattern to a set of failure detector histories. Thus, the specifica-
tion of a failure detector depends solely on the failure pattern actually encountered.
In contrast, the definition of repentance depends on the knowledge (about mis-
takes) at each process. This in turn depends on the algorithm being executed, and
the communication pattern actually encountered. Thus, repentant failure detec-
tors cannot be specified solely in terms of the failure pattern actually encountered.
Nevertheless, repentance is an important property that we would like many failure
detectors to satisfy.
In the rest of this Appendix, we informally define a hierarchy of repentant failure
detectors that differ by their accuracy (i.e., the maximum number of mistakes they
can make). As we just noted, such failure detectors cannot be specified solely in
terms of the failure pattern actually encountered, and thus they do not fit the
formal definition of failure detectors given in Section 3.1.2.
A.2 A hierarchy of repetant failure detectors
We now define an infinite hierarchy of repentant failure detectors. Every failure
detector in this hierarchy satisfies weak completeness, repentance, and one of the
four types of accuracy that we defined in the previous section. We name these
failure detectors after the accuracy property that they satisfy:
0 8?(k) denotes a Strong1? k-Mistaken faziure detector,
110
=N Q (strongest)Consensus solvable for all f ? n
Consensus solvable iff f ? n
Consensus solvable iff f ? n --H 1
N 8			W			S?([-?2J --H 1)Consensus solvable iff f ? [-n21 +2
Consensus solvable
for all f ? n			S?([??2i)Consensus solvable iff f ?			+ 1
Consensus solvable iff			8? N			N?
f?[Th21			-			?)/V (weakest)
Figure A.1: The hierarchy of repentant failure detectors ordered by reducibility.
This figure also shows the maximum number of faulty processes for which Consen-
sus can be solved using each failure detector in this hierarchy.
111
o+ S? denotes a Stron9ly Finitely Mistaken failure detector
o+ VV?(k) denotes a Weakly k-Mistaken failure detector, and
o+ )`V? denotes a Weakly Finitely Mistaken failure detector
Clearly, S?(O) ? 8?(1) ? ... S?(k) ? S?(k t 1) ? ... ? S?. A similar
order holds for the 14'?s. Consider a system of n processes of which at most
f may crash. In this system, there are at least n --H f correct processes. Since
--H f) --H 1) makes fewer mistakes than the number of correct processes, there
is at least one correct process that it never suspects. Thus, S?((n --H f) --H 1) is
weakly 0-mistaken, and S?((n --H f) --H 1) ? W?(o). Furthermore, it is clear that
? VV?. This infinite hierarchy of failure detectors, ordered by reducibility, is
illustrated in Figure A.1 (where an edge denotes the ? relation).
Each of the eight failure detectors that we considered in Section 3.1.5 is equiv-
alent to some failure detector in this hierarchy. In particular, it is easy to show
that:
Observation 67:
=N Q N
.8 N= ? N
o+ ??N=?QNs?Thd
. =N ??
For example, it is easy to see that the reduction algorithm in Figure 3.3 transforms
W? into ?W. Other conversions are similar or straightforward and are therefore
omitted. Note that ? and ?W are the strongest and weakest failure detectors in
this hierarchy, respectively. From Corollaries 15 and 31, and Observation 67 we
have:
112
Corollary 68: Given )/VF(O), Consensus and Atomic Broadcast are solvable in
asynchronous systems with f ?
Similarly, from Corollaries 20 and 32, and Observation 67 we have:
Corollary 69: Given W?, Consensus and Atomic Broadcast are solvable in asyn-
chronous systems with f ?
A.3 Tight bounds on fault-tolerance
Since Consensus and Atomic Broadcast are equivalent in asynchronous systems
with any number of faulty processes (Theorem 30), we can focus on establishing
fault-tolerance bounds for Consensus. In Section 3.5, we showed that failure detec-
tors with perpetualaccuracy (i.e.,?, Q, 8, or 14') can be used to solve Consensus in
asynchronous systems with an? number of failures. In contrast, with failure detec-
tors with eventual accuracy (i.e., ??, ?Q, ?S, or ?W), Consensus can be solved
if and only if a majority of the processes are correct. We now refine this result by
considering each failure detector D in our infinite hierarchy of failure detectors, and
determining how many correct processes are necessary to solve Consensus using D.
The results are illustrated in Figure A.1.
There are two cases depending on whether we assume that the system has a
majority of correct processes or not. Since ?14', the weakest failure detector in
the hierarchy, can be used to solve Consensus when a majority of the processes are
correct, we have:
Observation 70: If f ? ??2 then Consensus can be solved using any failure detec-
tor in the hierarchy of Figure A.1.
We now consider the solvability of Consensus in systems that do not have a
majority of correct processes. For these systems, we determine the maximum m
for which Consensus is solvable using S?(m) or 14'?(m). We first show that
113
Consensus is solvable using S?(m) if and only if ni, the number of mistakes, is
less than or equal to n --H f, the number of correct processes. We then show that
Consensus is solvable using VY?(m) if and only if m 0.
Theorem 71: Suppose f > ??2' If m > n --H f then there is a Strongly m-Mistaken
failure detector 8?(m) such that there is no algorithm A which solves Consensus
using Sf(m) in asynchronous systems.
PROOF: [sketch] We describe the behaviour of a Strongly m-Mistaken failure de-
tector S?(m) such that with every algorithm A, there is a run RA of A using
S?(m) that does not satisfy the specification of Consensus. Since 1 ? n --H f ?
we can partition the processes into three sets Ilo, Hi and 11crashed, such that 11o and
Hi are non-empty sets containing n --H f processes each, and 11crashed is a (possibly
empty) set containing the remaining n --H 2(n --H f) processes. For the rest of this
proof, we will only consider runs in which all processes in 11crashed crash in the
beginning of the run. Let qo ? 11o and qi C Hi. Consider any Consensus algorithm
A, and the following two runs of A using Sf(m):
o+ Run J?o = (F0, H0, I, s0, T0?: All processes in Ho propose 0, and all processes
in Hi U 11crashed propose 1. All processes in 11o are correct in F0, while all
the f processes in Hi U 11crashed crash in F0 at the beginning of the run,
i.e., Vt ? T : F0(t) = Hi U 11crashed' Process qo permanently suspects every
process in Hi U 11crashed, i.e., Vt e I : H0(q0, t) = Hi U 11crashed = F0(t).
No other process suspects any process, i.e., Vt ? T,Vq ? : H0(q,t) =
In this run, it is clear that S?(rn) satisfies the specification of a Strongly
m-Mistaken failure detector.
o+ Run Ri = (Fi,Hi,I,Si,Ti?: As in R0, all processes in 11o propose 0, and
all processes in Hi U11crashed propose 1. All processes in Hi are correct in Fi,
while all the f processes in 11o U 11crashed crash in Fi at the beginning of the
run, i.e., Vt ? I : Fi(t) = 11o U 11crashed' Process qi permanently suspects
114
every process in 11o U 11crashed, and no other process suspects any process.
Clearly, 8?(m) satisfies the specification of a Strongly m-Mistaken failure
detector in this run.
Assume, without loss of generality, that both 1?0 and 1?1 satisfy the specification
of Consensus. Let to be the time at which qo decides in 1?0, and let ti be the time at
which qi decides in 1?1. There are three possible cases in each case we construct
a run 1?A (FA, HA, 1A, 5A, TA? of algorithm A using S?(m) such that 8?(m)
satisfies the specification of a Strongly m-Mistaken failure detector, but 1?A violates
the specification of Consensus.
1. In 1?0, qo decides 1. Let 1?A (F0, H0, 1A, ?0, T0? be a run identical to 1?o
except that all processes in lliU11crashed propose 0. Since in F0 the processes
in lli U 11crashed crash right from the beginning of the run, 1?0 and 1?A are
indistinguishable to qo. Thus, qo decides 1 in 1?A (as it did in 1?o), thereby
violating the uniform validity condition of Consensus.
2. In Ri, qi decides 0. This case is symmetric to Case 1.
3. In 1?0, qo decides 0, and in Ri, qi decides 1. Construct 1?A (PA, HA, I, 5A, TA?
as follows. As before, all processes in 11o propose 0, all processes in lli U
11crashed propose 1, and all processes in 11crashed crash in ?A at the beginning
of the run. All messages from processes in 11o to those in 11i and vice-versa,
are delayed until time to t ti. Until time to, 1?A is identical to 1?o, except
that the processes in 11i do not crash, they are only "very slow" and do not
take any steps before time to. Thus, until time to, qo cannot distinguish be-
tween 1?o and 1?A, and it decides 0 at time to in 1?A (as it did in 1?o). Note
that by time to, the failure detector 8?(?71) made n --H f mistakes in 1??: qo
erroneously suspected that all processes in 11i crashed (while they were only
slow).
From time to, the construction of 1?A continues as follows.
115
(a) At time t0, all processes in 11o, except qo, crash in FA.
(b) From time to to time to t ti, qi suspects all processes in Ilo U 11crashed,
i.e., Vt,t0 ? t ? t0+t1 : H?(qi,t) IIoUllcrashed, and no other process
suspects any process. By suspecting all the processes in 11o, including
qo, the failure detector makes one mistake on process qi (about qo).
Thus, by time to + ti, 8?(m) has made a total of (n --H f) + 1 mistakes
in 1?A Since m > n --H f, S?(m) has made at most m mistakes in 1?A
until time to + ti.
(c) At time to, processes in fli "wake up." From time to to time to +ti they
execute exactly as they did in 1?1 from time 0 to time t1 (they cannot
perceive this real-time shift of t0). Thus, at time to + ti in run 1?A, ?i
decides 1 (as it did at time ti in 1?i) So qo and qi decide differently in
1?A, and this violates the agreement condition of Consensus.
(d) From time to + ti onwards the run 1?A continues as follows. No more
processes crash and every correct process suspects exactly all the pro-
cesses that have crashed. Thus, 8?(m) satisfies weak completeness,
repentance, and makes no further mistakes.
By (b) and (d), S?(m) satisfies the specification of a Strongly m-Mistaken
failure detector in run 1?A From (c), 1?A, a run of A that uses
violates the specification of Consensus.			Li
We now show that the above lower bound is tight: Given S?(m), Consensus
can be solved in asynchronous systems with m <n --H
Theorem 72: If m ? n --H f then Consensus can be solved in asynchronous systems
using any Strongly m-Mistaken failure detector S?(m).
PROOF: Suppose m ? n --H f. Since m, the number of mistakes made by S?(m),
is less than the number of correct processes, there is at least one correct process
116
that S?(m) never suspects. Thus, 8?(m) satisfies weak accuracy. By definition,
S?(m) also satisfies weak completeness. So S?(m) is a Weak Failure Detector
and can be used to solve Consensus (Corollary 15)
Suppose m = n --H f. Even though S?(m) can now make a mistake on every
correct process, it can still be used to solve Consensus (even if a majority of the
processes are faulty). The algorithm uses rotating coordinators, and is similar to
the one for ?W in Figure 3.6. Because of this similarity, we omit the details from
this Appendix.
From the above two theorems:
Corollary 73: Suppose f > ??2. Consensus can be solved in asynclironous systems
using any S?(rn) if and only if m < n --H
We now turn our attention to Weakly k-Mistaken failure detectors.
Theorem 74: Suppose f > ?2 If m > 0 then there is a Weakly m-Mistaken
failure detector W?(m) such that there is no algorithm A which solves Consensus
using W?(m) in asynchronous systems.
PROOF: In Theorem 71, we described a failure detector that cannot be used to
solve Consensus in asynchronous systems with f > -n2 It is easy to verify that this
failure detector makes at most one mistake about each correct process, and thus
it is a Weakly ni-Mistaken failure detector. E]
From Corollary 68 and the above theorem, we have:
Corollary 75: Suppose f > -n2 Consensus can be solved in asynchronous systems
using any W?(m) if and only if m = 0.
Bibliography
[ABD+87]
[ADKM91]
[ADLS91]
[BCJ+90]
[BGP89]
[BGT9Di
Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Daphne Koller, David
Peleg, and Riidiger Reiscliuk. Achievable cases in an asynchronous en-
vironment. In Proceedings of the Twenty-Eighth Symposium on Foun-
dations of Computer Science, pages 337--H346. IEEE Computer Society
Press, October 1987.
Y. Amir, D. Dolev, 5. Kramer, and D. Malki. Transis: A commu-
nication sub-system for high availability. Technical Report CS91-13,
Computer Science Department, The Hebrew University of Jerusalem,
November 1991.
Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer.
Bounds on the time to reach agreement in the presence of timing un-
certainity. In Proceedings of the Twenty third A CM Symposium on
Theory of Computing, May 1991.
Kenneth P. Birman, Robert Cooper, Thomas A. Joseph, Kenneth P.
Kane, and Frank Bernhard Schmuck. 1515 - A Distributed Program-
ming Environment, June 1990.
Piotr Berman, Juan A. Garay, and Kenneth J. Perry. Towards opti-
mal distributed consensus. In Proceedings of the Thirtieth Symposium
on Foundations of Computer Science, pages 410--H415. IEEE Computer
Society Press, October 1989.
Navin Budhiraja, Ajei Gopal, and Sam Toueg. Early-stopping dis-
tributed bidding and applications. In Proceedings of the Fourth
International IYorkshop on Distributed Algorithms. Springer-Verlag,
September 1990. In press.
Kenneth P. Birman and Thomas A. Joseph. Reliable communication
in the presence of failures, ACM Transactions on Computer Systems,
5(1):47--H76, February 1987.
[BJ87j
117
118
[BMz88] Ofer Biran, Shiomo Moran, and Shmuel Zaks. A combinatorial charac-
terization of the distributed tasks that are solvable in the presence of
one faulty processor. In Proceedings of the Seventh ACM Symposium
on Principles of Distributed Computing, pages 263--H275, August 1988.
[BW87]			M.
[CASD85]
[CD89]
[CDD9o]
[CM841
[Cri87]
[CT9o]
[DDS87]
Bridgiand and R. Watro. Fault-tolerant decision making in totally
asynchronous distributed systems. In Proceedings of the Sixth ACM
Symposium on Principles of Distributed Computing, 1987.
Flaviu Cristian, Houtan Aghili, II. Raymond Strong, and Danny Dolev.
Atomic broadcast: From simple message diffusion to Byzantine agree-
ment. In Proceedings of the Fifteenth International Symposium on
Fault-Tolerant Computing, pages 200--H206, June 1985. A revised ver-
sion appears as IBM Research Laboratory Technical Report RJ5244
(April 1989).
Benny Chor and Cynthia Dwork. Randomization in byzantine agree-
ment. Advances in Computer Research, 5:443--H497,1989.
and Jon Dehn. Fault-tolerance
Technical Report RJ 7424, IBM
Flaviu Cristian, Robert D. Dancey,
in the advanced automation system.
Research Laboratory, April 1990.
J. Chang and N. Maxemchuk. Reliable broadcast protocols. ACM
Transactions on Computer Systems, 2(3):251--H273, August 1984.
Flaviu Cristian. Issues in the design of highly available computing ser-
vices. In Annual Symposium of the Canadian Information Processing
Society, pages 9--H16, July 1987. Also IBM Research Report RJ5856,
July 1987.
Tushar Deepak Chandra and Sam Toueg. Time and message efficient
reliable broadcasts. In Proceedings of the Fourth International Work-
shop on Distributed Algorithms. Springer-Verlag, September 1990. In
press.
Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal
synchronism needed for distributed consensus. Journal of the ACM,
34(1):77--H97, January 1987
Danny Dolev, Nancy A. Lynch, Shlomit 5. Pinter, Eugene W. Stark,
and William E. Weihl. Reaching approximate agreement in the pres-
ence of faults. Journal of the ACM, 33(3):499--H516, July 1986.
[DLP+86j
119
[DLS88]
[Fis83]
[FLP85i
[GSTC90J
[llad84j
[11M90]
Cynthia Dwork, Nancy A. Lynch, and Larry Stockmeyer. Consensus in
the presence of partial synchrony. Journal of the ACM, 35(2):288--H323,
April 1988.
Michael J. Fischer. The consensus problem in unreliable distributed
systems (a brief survey). Technical Report 273, Department of Com-
puter Science, Yale University, June 1983.
Michael J. Fischer, Nancy A. Lynch, and Michael 5. Paterson. Impos-
sibility of distributed consensus with one faulty process. Journal of the
ACM, 32(2):374--H382, April 1985.
Ajei Copal, Ray Strong, Sam Toueg, and Flaviu Cristian. Early-
delivery atomic broadcast. In Proceedings of the Ninth ACM Sympo-
sium on Principles of Distributed Computing, pages 297--H310, August
1990.
Vassos lladzilacos. Issues of Fault Tolerance in Concurrent Computa-
tions. Ph.D. dissertation, Harvard University, June 1984. Department
of Computer Science Technical Report 11-84.
Joseph Y. Halpern and Yoram Moses. Knowledge and common knowl-
edge in a distributed environment. Journal of the ACM? 37(3):549--H587,
July 1990.
[LA87] M.C. Loui and Abu-Amara. Memory requirements for agreement
among unreliable asynchronous processes. Advances in computing re-
search, 4:163--H183,1987.
[Lam78j Leslie Lamport. The implementation of reliable distributed multipro-
cess systems. Computer Networks, 2:95--H114,1978.
[LF82j
[Lo93]
[LSP82]
Leslie Lamport and Michael Fischer. Byzantine generals and transac-
tion commit protocols. Technical Report 62, SRI International, April
1982.
Wai Kau Lo. Using failure detectors to solve consensus in asynchronous
shared-memory systems. Masters dissertation, University of Toronto,
January 1993.
Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine
generals problem. ACM ?ransactions on Programming Languages and
Systems, 4(3):382--H401, July 1982.
Yoram Moses, Danny Dolev, and Joseph Y. Halpern. Cheating hus-
bands and other stories: a case study of knowledge, action, and com-
munication. Distributed Computing, 1(3): 167--H176, 1986.
[MDH86]
120
[MSF87]
[Mu187]
[PBS89]
[PGM89]
C. Mohan, R. Strong, and 5. Finkeistein. Methods for distributed
transaction commit and recovery using Byzantine agreement within
clusters of processors. In Proceedings of the Sith ACM Symposium on
Principles of Distributed Computing, 1987.
Sape J. Mullender, editor. The Amoeba distributed operating system:
Selected papers 1981 - 1987. Centre for Mathematics and Computer
Science, 1987.
L. L. Peterson, N. C. Bucholz, and Richard D. Schlichting. Preserving
and using context information in interprocess communication. In ACM
rPransactions on computer systems 7,3, pages 217--H246, August 1989.
Frank Pittelli and Hector Garcia-Molina. Reliable scheduling in a tmr
database system. ACM Transactions on Computer Systems, 7(1):25--H
60, February 1989.
[Pow91] D. Powell, editor. Delta-4: A Generic Architecture for Dependable
Distributed Computing. Springer-Verlag, 1991
[P5L80] M. Pease, R. Shostak, and Leslie Lamport. Reaching agreement in the
presence of faults. Journal of the ACM, 27(2):228--H234, April 1980.
[PT86j
[RB91j
[Rei82]
[5ch90]
Kenneth J. Perry and Sam Toueg. Distributed agreement in the pres-
ence of processor and communication faults. IEEE Transactions on
Software Engineering, 12(3):477--H482, March 1986.
Aleta Ricciardi and Ken Birman. Using process groups to implement
failure detection in asynchronous environments. In Proceedings of the
Tenth A CM Symposium on Principles of Distributed Computing, pages
341--H351. ACM Press, August 1991.
Riidiger Reischuk. A new solution for the Byzantine general's problem.
Technical Report RJ 3673, IBM Research Laboratory, November 1982.
Fred B. Schneider. Implementing fault-tolerant services using the state
machine approach: A tutorial. ACM Computing Surveys, 22(4):299--H
319, December 1990.
John II. Wensley, Leslie Lamport, Jack Goldberg, Milton W. Green,
Karl N. Levitt, P.M. Melliar-Smith, Robert E. Shostak, and Charles B.
Weinstock. SIFT: Design and analysis of a fault-tolerant computer for
aircraft control. Proceedings of the IEEE, 66(10):1240--H1255, October
1978.
[WLG+78]
