yes
yes
BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1353
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: The Primary-Backup Approach: Lower and Upper Bounds
AUTHOR:: Budhiraja, Navin 
DATE:: June 1993
PAGES:: 194
COPYRIGHT:: Navin Budhiraja 1993 - All Rights Reserved
ABSTRACT::
The most widely used approach to building replicated, fault-tolerant services 
is the primary-backup approach. In this approach, the state of the service is 
replicated across multiple servers, with one server designated as the primary 
and the rest as backups. Clients send requests only to the primary. However, 
in case the primary fails, one of the backups takes over as the new primary. 
Ever since it was introduced in 1976 by Alsberg and Day, the primary-backup 
approach has become the basis for building many practical fault-tolerant 
services. However, despite the widespread use, the approach has not been 
studied systematically, and little is known of the fundamental costs and 
tradeoffs of using the approach under various kinds of failures. Thus, there 
is a gap between theory and practice. In order to close this gap, this thesis 
analyzes the primary-backup approach, both from the theoretical perspective of 
specification, lower bounds and upper bounds, as well as from the practical 
viewpoint of performance tradeoffs in protocols. We identify three key cost 
metrics of primary-backup protocols--degree of replication, blocking time and 
failover time--and then show lower and upper bounds on these metrics for a 
hierarchy of failure models. We then implement an important subclass of our 
primary-backup protocols, called 0-blocking protocols, and give performance 
figures. In addition to leading to the development of new, more efficient 
protocols, we believe that the work in this thesis has resulted in a better 
understanding of the properties of existing primary-backup protocols.
END:: CORNELLCS//TR93-1353
BODY::
The Primary-Backup Approach:
Lower and Upper Bounds
Navin Budhiraja
Ph.D Thesis
93-1353
June1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
THE PMMARY-BACNUP APPROACH:
LOWER AXD UPPER BOUXDS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Navin Budhiraja
August 1993
Qc Navin Budhiraja 1993
ALL RIGHTS RESERVED
THE PRIMARY-BACKUP APPROACH:
LOWER AND UPPER BOUNDS
Navin Budhiraja, Ph.D.
Cornell University 1993
The most widely used approach to building replicated, fault-tolerant services
is the primary-backup approach. In this approach, the state of the service is
replicated across multiple servers, with one server designated as the primary, and
the rest as backups. Clients send requests only to the primary. However, in case
the primary fails, one of the backups takes over as the new primary. Ever since
it was introduced in 1976 by Alsberg and Day, the primary-backup approach has
become the basis for building many practical fault-tolerant services. However,
despite the widespread use, the approach has not been studied systematically, and
little is known of the fundamental costs and tradeoffs of using the approach under
various kinds of failures. Thus there is a gap between theory and practice. In order
to close this gap, this thesis analyzes the primary-backup approach, both from the
theoretical perspective of specification, lower bounds and upper bounds, as well
as from the practical viewpoint of performance tradeoffs in protocols. We identify
three key cost metrics of primary--Hbackup protocols--Hdegree of replication, blocking
time and failover time--Hand then show lower and upper bounds on these metrics
for a hierarchy of failure models. ?`e then implement an important subclass of
our primary--Hbackup protocols, called 0--Hblocking protocols, and give performance
figures. In addition to leading to the development of new, more efficient protocols,
we believe that the work in this thesis has resulted in a better understanding of
the properties of existing primary-backup protocols.
Biographical Sketch
Navin Budhiraja was born in Bhilai, Madhya Pradesh, India on January 17,1966.
Though he spent most of his childhood playing cricket in Bhilai, he also attended
Higher Secondary School, and graduated in 1984. Subsequently, he joined the
Computer Science Department at Indian Institute of Technology, Kanpur, from
where he graduated with a bachelor of technology degree in 1988. In Fall 1988, he
enrolled in the Computer Science Department at Cornell University, Ithaca, N.Y.
as a graduate student. He received a masters of science degree in May 1991, and
was awarded a doctor of philosophy degree in August 1993.
iii
Acknowledgements
First of all, I would like to thank my advisor, Keith Marzullo, without whom this
work would have not been possible, He provided me with direction and encour-
agement that has proved invaluable during the course of this work. I would also
like to thank Fred Schneider and Sam Toueg, both of whom provided considerable
assistance needed to bring this work to completion. Among other things, Fred's
insight on building fault--Htolerant services has been extremely helpful. And Sam,
in addition to introducing me to the basics of Italian cuisine, also introduced me
to research. I am also indebted to other members of my committee: Larry Blume
and Devika Subramanium. The presentation in this thesis has benefited greatly
from the comments of Keith, Fred, Devika and Sam.
Thanks also go to Prasad Jayanti for helping with the ideas behind Chapter 2. I
have also benefited from various discussions with other members of the distributed
systems group here at Cornell, specifically Tushar Chandra, Ajei Gopal, Mike
Reiter, Pat Stephenson and Mark Wood.
The financial support provided by the Defence Advanced Research Projects
Agency under Nasa Ames grant number NAG 2-593, Contract NOOl4O-87-C-89O4
is also acknowledged. In addition, this work has also been supported by an I.B.M.
Graduate Fellowship.
iv
Lastly, I am indebted to a beloved group of people without whom I would
never have reached this stage in my life. These are my parents Chaman and Sneh
Budhiraja, my sister Sabina and my wife Sonal. NIy family has been a source of
pleasure, support and encouragement all through.
v
Table of Contents
2
3
4
5
Introduction
1.1 Contributions of the Thesis
1.2			Related Work			. . .
1.3			System Model
Impossibility of Building Asynchronous, Fault-Tolerant Services
2.1			Specification of the History system			.
2.2			Impossibility Result
2.2.1			Formal proof of correctness			.			. . .
Specification of the Primary--HBackup Approach
3.1			Spedfic?tion			. . .
3.2			Formal statement of Pb4 .			. .
Lower Bounds
4.1 Bounds on Replication
4.2 Bounds on Blocking.
4.3 Bounds on Failover Times
4.4			Summary			. . .
4.4.1 Protocol having D> F
4.4.2 Protocol violating PbS
Optimal Primary-Backup Protocols--HPart I
5.1 Protocols for the Clients and the Bofo Server
5.2			The Primary-Backup Protocol Schema			. .
5.2.1			Proof of Correctness			. .
5.2.2 Message losses between the clients and the servers
5.3			Implementation for the various failure models
5.3.1			Crash Failures . . . . . .
5.3.2			Crash+link failures.
5.3.3			Send--HOmission failures			.			. .
5.3.4			Receive?-Omission failures . 			. .
5.3.5			General-Omission failures
vi
3
6
9
12
14
16
18
20
20
22
25
26
33
43
48
48
51
54
54
55
60
75
76
76
79
84
89
95
6
7
8
Optimal Primary--HBackup Proto cols--HPart II
6.1 Send--HOmission failures .
6.1.1 Proof of correctness
6.2 General--HOmission Failures
6.2.1 Proof of correctness
6.3 Receive--HOmission failures
6.3.1 Proof of Correctness
Implementation Results
7.1			Failure Assumptions . . . .
7.2			Description of the implementation . .			. . .
7.3			Results .
7.3.1 Communication using point--Hto--Hpoint messages
7.3.2 Communication using hardware broadcasts
7.4 Extending the results to runs with failures . . .
7.5 Discussion
Survey of Existing Primary--HBackup Protocols
8.1			The Alsberg--HDay Protocol			. .			. .
8.2			The Tandem Protocol . .			.			.
8.3 The Viewstamped Replication Protocol
8.4			ECHO .			.			.
8.5			HA-NFS			. .			. .
8.6			HARP .
8.7			Summary			.			. . .
9 Conclusions
9.1 Future Directions .
Bibliography
vii
loo
103
107
117
118
131
137
156
157
158
160
160
164
168
170
171
172
173
174
176
177
179
180
184
187
189
List of Tables
4.1			Summary of lower bound results
8.1			Summary of survey results. .
9.1 Summary of lower bound and upper bound results
viii
o+ . . 49
181
186
List of Figures
1.1
1.2
1.3
1.4
2.1
2.2
2.3
2.4
The primary-backup approach
A backup takes over as the primary
The state--Hmachine approach
Server s1 has failed			. .
The function RE?UEST			.			. .			.
Protocol executed by server s . .
Protocol run by client Cj with initial value Vj (symmetric for c?)
Consensus protocol executed by each server, in addition to execut-
ing the protocol corresponding to Ii			.			. .
4.1 Protocol having D> I'
4.2 Protocol with f = 2 and Failover Time < 2?
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
2
2
8
14
15
17
17
49
52
Protocol run by a canonical server P
Protocol run by client i interacting with server a
Protocol run by server sj to emulate server P
Procedures primary and backup .			. .
Procedures for crash failures . . . .			.
Procedures for crash+link failures-i .			. .
Procedures for crash+link failures-ii			.			. .
Procedures for send?omission failures-i			.
Procedures for send?mission failures-ii
Procedures for receive?mission failures
Procedures for general?mission failures
55
56
56
58
77
80
81
85
86
90
96
Protocol run by server sj for send?mission failures (fls			= 2, f = 1)			103
Procedures for send?mission faiUres(n8 = 2, f = 1)-i			104
Procedures for send?omission failures(n? = 2, f = 1)-ii105
Procedures for send?mission faiUres(n3 = 2, f = 1)-iii 106
Protocol run by server Sj for general?omission failures(n3 = 3, f = 1)117
Procedures for general?mission failures(ns			= 3, f =			1)-i			119
Procedures for general?mission failures(n?			= 3, f =			1)-ii			. .			120
Procedures for general?omission faiUres(n?			= 3, f =			1)-iii			. . .			121
ix
6.9
6.10
6.11
6.12
7.1
7.2
7.3
7.4
7.5
7.6
Protocol run by server sj for receive--Homission failures(n? = 2, f = 1)131
Procedures for receive?mission faihres(n? = 2, J =			1)-i			132
Procedures for receive?mission faiUres(n? = 2, f =			1)-ji			. .			. 133
Procedures for receive?omission failures(n? = 2, f =			1)-iji			134
Avg. response time vs. degree of replication. Number of clients =
1 and compute time = 10 ms . . . 162
Avg. response time vs. degree of replication. Number of clients =
2 and compute time = 10 ms163
Avg. response time vs. degree of replication. Number of clients =
2 and compute time = 30 ms. 			164
Avg. response time vs. degree of replication with query requests.
Number of clients = 2 and compute time = 10 ms165
Avg. response time vs. degree of replication. Number of clients =
1 and compute time = 10 ms166
Avg. response time vs. degree of replication. Number of clients =
2 and compute time = 10 ms167
Avg. response time vs. degree of replication. Number of clients =
2 and compute time = 2 ms			. .			. . .			. .			. . .			168
x
7.7
Chapter 1
Introduction
The most widely used technique for building fault--Htolerant client--Hserver systems
is the primary--Hbackup approach [AD76] shown in Figure 1.1. In this approach,
any service is implemented using an ensemble of servers. The state of the service
is replicated across the servers, with one server designated as the primary (server
si in the figure), and the remaining servers designated as backups. Clients send
requests to the primary, which performs the requested service, and then responds
to the clients. However, in order to keep the backups up-to-date with any changes
that occur in the state of the service, the primary informs the backups of any state
updates that occur. In case the primary fails, one of the backups (server ?2 in
Figure 1.2) takes over as the primary, and clients send subsequent requests to this
new primary.
Ever since it was introduced in 1976, the primary-backup approach has become
the basis for building many practical and research systems. In particular, this ap-
proach has been used in the construction of a fault-tolerant version of UNIX'
1UNIX is a registered trademark of AT&T.
A
53			sn
2
M
Th			mm
M			_			[-i
Clients
Servers
Figure 1.1: The primary--Hbackup approach
ffi;
\y yffiMZ
7
53
Clients
M			M
M
Servers
Figure 1.2: A backup takes over as the primary
3
[BBG+89], fault--Htoleraiit parallel programs [BAA+92], highly available transac-
tion processing systems [BarSi] and highly available file systems [MHS89,BEM9l,
LGC+91]. However, despite this widespread use in extant systems, the primary-
backup approach has not attracted much systematic study, and little is known of
the fundamental costs and tradeoffs of using this approach under various kinds of
failures. For example, even though a variety of primary-backup protocols have
been mentioned in the literature, it is not known whether these protocols are opti-
mal in the number of servers they use, in the amount of time they take to respond
to a client request, and so on. Thus, there is a sizable gap between theory and
practice. In order to close this gap, this thesis studies the primary--Hbackup ap-
proach, both from a theoretical viewpoint of specification, lower bounds and upper
bounds, and from a practical viewpoint of performance tradeoffs in protocols. In
addition to leading to a better understanding of existing primary-backup protocols
by identifying which of these protocols are optimal, our work has also led to the
development of new, optimal primary--Hbackup protocols.
1.1 Contributions of the Thesis
Consider the following three key cost metrics of any primary--Hbackup protocol.2
The first is the degree of replication or the number of servers used to imple-
ment the protocol. For example, a protocol that uses 2f + 1 servers to tolerate
f server failures has a degree of replication of 2f + 1. The second metric is the
blocking time or the worst-case period between a request and its response in any
failure--Hfree execution of the protocol. For example, many existing primary--Hbackup
2See Chapter 4 for formal definitions
4
protocols work as follows. In order to ensure that the states of the backups are
up-to-date with the state of the primary, the primary sends to the backups any
state update that occurs due to the receipt of a request, waits for the backups to
acknowledge, and only then sends a response to a client. Therefore1 assuming that
all computation takes negligible time, the blocking time of these protocols equals
the amount of time it takes to send a message and receive the corresponding ac-
knowledgement. Finally, the last metric is the failover time or the worst-case
period of time during which there is no primary. Any primary--Hbackup protocol
will have such a period because the current primary may fail, and consequently, a
new primary will have to be elected.
Clearly, one is interested in knowing the values of these metrics for any primary--H
backup protocol. Not only do these metrics provide a basis for evaluating the
protocol, they also provide a basis for comparing the protocol with other primary-
backup protocols. However, one is also interested in answering a more fundamental
question:
Given that no more than f components can fail, what are the smallest possi-
ble values of the the degree of replication, the blocking time and the failover
time?
An answer to the above question defines, in practical terms, what proper-
ties an optimal primary-backup protocol should have. Unfortunately, despite the
widespread use of the primary-backup approach, the above question had not been
addressed in the literature. We, therefore, do so in this thesis.
We answer this question by giving lower bounds for the three cost metrics that
we identify, where a tower bound for a metric gives the smallest possible value of
5
that metric for any possible pnmary-backup protocol. In turns out that the values
of these lower bounds also depend on the underlying failure assumptions, and so
we give the lower bounds for a commonly used hierarchy of failure models.
The existence of a lower bound, however, does not imply the existence of a
protocol that achieves this bound. For example, degree of replication has a trivial
lower bound of 0 (since the number of servers required to implement any primary-
backup protocol cannot be negative). However, clearly no protocol can achieve
this lower bound, because there has to be at least one server to send responses
to the client requests. Therefore, in this thesis, we also demonstrate that most of
our lower bounds are tight by showing that these lower bounds can be achieved in
practice. We do this by giving optimal protocols for the various failure models i.e.
protocols for which the degree of replication, the blocking time, and the failover
time are equal to the values specified by the corresponding lower bounds.
Before we can derive lower bounds and upper bounds for the primary-backup
approach, we need to precisely state what we mean by this approach i.e. give a
specification for the primary--Hbackup approach. The specification that we give in
this thesis is not arbitrary, and we arrive at it by abstracting from existing protocols
the "essence" of the approach. Not only is our specification weak enough to admit
protocols that one would consider to be primary--Hbackup, it is also strong enough to
disallow protocols that one would not consider to be primary--Hbackup. Since this
is the first ever precise specification for primary--Hbackup, the specification itself
is an important contribution of this thesis because it provides a concrete point
from which the primary--Hbackup approach can be compared to other approaches
for structuring fault--Htolerant services.
6
The lower bound and upper bound results in this thesis have led to a better
understanding of existing primary-backup protocols by identifying which of these
protocols are optimal with respect to the metrics that we define. However, the
contributions of this thesis are not entirely of a theoretical nature. Our theoreti-
cal work has led to the development of new, optimal primary--Hbackup protocols in
failure models where protocols were previously unknown. In order to compare the
performance of these new protocols against existing protocols, and to better under-
stand the applicability of our assumptions to practical systems, we also implement
and analyze a subset of our protocols. The protocols that we implement form a
very interesting subclass of the primary--Hbackup protocols because these protocols
are theoretically the most general class of protocols that can achieve the smallest
possible response time.
1.2 Related Work
As we mentioned earlier, the primary--Hbackup approach is a widely used tech-
nique for building replicated, fault--Htolerant services. This approach, however, is
not the only approach for building fault-tolerant services. Another important ap-
proach is the state-machine approach[Lam78,Sch90]. Like the primary--Hbackup
approach, the state--Hmachine approach replicates the state of the service across
all the servers implementing the service. However, unlike the primary-backup ap-
proach, any client request is atomicatly broadcast [CM84,CASD85,BJ87,BGT90] to
each server (see Figure 1.3). Informally, atomic broadcast ensures that all (correct)
servers receive the same set of requests, and they receive them in the same order.
On receiving such a request, each server performs the requested service, updates
7
______ I
?1			52			53			sn
Clients
Servers
Figure 1.3: The state--Hmachine approach
its local state if necessary, and then responds to the client.3 Since a request is
sent to each server, the failure of a server does not disrupt the service because
the clients continue of get responses from the other servers (see Figure 1.4). The
dashed arrow in Figure 1.4 indicates that the crashed server 51 does not receive the
request. Among others, this approach has been used in the design of process control
applications [?VLG+78], distributed synchronization [Sch80] and highly available
distributed services [LLS6,MS88,Sie92].
The state--Hmachine approach has attracted a great deal of theoretical study
[Sch90], and a lot is known of the costs and tradeoffs of using this approach under
various models of failures. For example, the fundamental primitives used in con-
structing state--Hmachines is atomic--Hbroadcast. Extensive amount of research has
been done in order to understand the costs of atomic--Hbroadcast for various models
of failures [PSL8o,DRS9o,GSTC9o,CDS9o].
3In case of crash failures, if the atomic broadcast protocol prevents inconsistency [GT9l], then
the client accepts any response that it receives. For other kinds of failures, the client accepts the
majority response.
s
?1			53			sn
Figure 1.4: Server 51 has failed
Clients
Servers
Furthermore, the state--Hmachine approach is the preferred mode of replication
in many time--Hcritical applications. This is because failures of individual servers
can be immediately masked from the clients--Hclients can continue to get responses
from the other servers. In addition, this approach needs to place no restrictions on
the behavior of the faulty servers and clients.
In contrast to the state--Hmachine approach, with the primary--Hbackup approach,
a request may not generate any response. This may occur if the request was sent
to a primary that had crashed. Even though it is possible to mask this failure by
resending the request until the client receives a response, clients now may see a
long delay before they receive this response. Furthermore, in contrast to the state--H
machine approach, the primary--Hbackup approach only works for benign failures.
However, the above two characteristics i.e. a longer delay in the unlikely case when
failures occur, and a restriction on the kinds of failures that can be tolerated, do not
preclude its use in many important applications, and the primary--Hbackup approach
is the one that is more extensively used in practice [Bar8l,OLS8,BBG+89,MHS89,
9
SBs9,CDD9o,BENI9t,LGG+911. This is because it requires less processing (only
at the primary, rather than at each server as in the state-machine approach), and
is less costly (because it does not require the implementation of expensive atomic--H
broadcast protocols). Furthermore, the primary-backup approach can be used to
build non--Hdeterministic services, whereas the state-machine approach requires the
servers to be deterministic.
The final section in this chapter describes the system model that we will use
in showing our results, and then briefly discusses the structure of the rest of the
thesis.
1.3 System Model
Two broad classes of primary--Hbackup protocols have been considered in the lit-
erature: asynchronous and synchronou& In asynchronous protocols [OL88], no
assumptions are made about the speeds of the processes, or the delays incurred in
delivering messages from one process to another; in synchronous protocols [AD76,
Bar8t,BEM91], both process speeds and message delivery delays are known and
bounded. An intermediate class of protocols in which the process speeds and the
message delays are unbounded, but processes have clocks with bounded speeds
have also been studied in the literature [NIHSS9,LGG+91]. However, for our pur-
poses, we will classify these protocols along with the asynchronous protocols. In
this thesis, we only concentrate on synchronous primary--Hbackup protocols, because
as we show in Chapter 2, it is impossible to guarantee both safety (the protocol
gives correct responses) [Lam85] and tiveness (the protocol eventually gives a re-
10
sponse) [AS86] in asynchronous protocols.4
We consider a distributed system consisting of ?s servers and ?c clients. The
servers have local clocks which are assumed to be perfectly synchronized with real
time.5 Clients and servers communicate through a completely connected, point-to-
point, FIFO network. Furthermore, if processes (clients or servers) pi and p? are
connected by a (non-faulty) link, then a message sent by p? to p? at time t arrives
at p? at some time t' Ei (t..t + ?. We call ? the maximum message delivery delay.
For simplicity, we assume that every request elicits a response. ?`e are only
interested in identifying the costs inherent in primary-backup protocols, and so we
also assume that it takes no time for a server to compute a response. However, our
lower bounds are not invalidated for servers that require a substantial amount of
time to compute a response. We consider the following hierarchy of failures:
Crash failures: A server may fail by halting prematurely. Until it halts, it behaves
correctly [LF821.
Crash+Link failures: A server may crash or a link may lose messages (but links
do not delay, duplicate or corrupt messages).
Receive-Omission failures: A server may fail not only by crashing, but also by
omitting to receive some of the messages directed to it over a non-faulty link [PT86].
Send-Omission failures: A server may fail not only by crashing, but also by
omitting to send some of the messages over a non-faulty link [Had84].
4The asyn7hronous protocols mentioned in the literature guarantee only safety. However, it is
possible to construct safe and live implementations in an asynchronous system augmented with
a faiture--Hdetector [BJ87,SS83j. We do not consider these in this thesis.
5Clearly, our lower bounds hold even if server clocks are only approximately synchro-
nized [LMs84]. Furthermore, by using the techniques in [NT87], our protocols can be aut?
matically extended to the case in which clocks are only approximately synchronized.
11
General-Omission failures: A server may exhibit send-omission or receive-omission
failures [PT86i.
Failures are counted by the number of failing components (either servers or
links). A protocol tolerates f failures of a given model if it works correctly despite
faulty behavior (as prescribed by that model) of up to f components.
The rest of the thesis is structured as follows. Chapter 2 shows that it is impos-
sible to construct primary--Hbackup protocols in asynchronous systems. Chapter 3
gives a precise specification of the primary--Hbackup approach, Chapter 4 gives our
lower bound results, and Chapters 5 and 6 give our upper bound results. We
divide our upper bound results into two separate chapters because our protocols
can be divided into two broad classes, and we address these two classes of proto-
cols separately. Chapter 7 describes our implementation, and Chapter 8 gives a
brief survey of existing primary--Hbackup protocols and compares the costs of these
protocols with the results in this thesis. We conclude in Chapter 9.
Chapter 2
Impossibility of Building
Asynchronous, Fault-Tolerant
Services
had mentioned in Chapter 1 that this thesis only concentrates on synchronous
primary--Hbackup protocols because it is impossible to guarantee both safety and
liveness in asynchronous protocols. In this chapter, we show this impossibility
result.1 In fact, the result we show is more general--Hwe show that the impossibility
holds for all protocols to building fault--Htolerant services, not just primary--Hbackup
protocols.
Consider the following client--Hserver system. A client c can send a request
by calling a function RE?UEST(c,op,[ar91... argn]), where op is the request and
....... arg? are the (possibly empty) list of arguments to op. A response (if any)
1A similar result for a shared memory environment was shown by [ller9l].
12
13
returned by this request should satisfy the following two properties:2 safrt?the
response is "consistent" with the responses returned earlier; and liveness--Hif a
correct client calls the function, then the function returns a response within a
finite (but possibly unbounded) amount of time.
The implementation of RE?UEST can depend on the way the client-server system
is structured. For example, if the system is structured using the state machine
approach, then RE?UEST can atomically broadcast the request to all the servers
that implement the system. If the system is structured using the the primary--H
backup approach, then RE?UEST might need to send more than one copy of the
request, perhaps to different primaries, in order to ensure that at least one of these
copies eventually generates a response to the request. In either case, RE?UEST can
compute the final response from the replies received, and return this response to
the client.
Clearly, we would like to construct such a RE?UEST function for any practical
client--Hserver system, i. e. we would like it to satisfy the above safety and liveness
properties. In fact, such systems ensuring both safety and liveness can be built in a
synchronous environment with bounded number of failures [AD 76, Bar81,BENl9 1 
Sch90]. However, we show in this chapter that in an asynchronous environment
with the possibility of failures, such a system cannot be built even for some simple
safety properties.3 Furthermore, our result holds even for partially asynchronous
environments, i.e. environments in which process speeds are bounded but mes-
sage delays are unbounded. We prove our result by showing that a safe and live
implementation of a simple client--Hserver system (called the History system) can
2More formal definitions appear later in the chapter.
3There are, however, safety properties for which implementations exist.
14
be used to solve the consensus problem [Fis83]. The result then follows from the
impossibility of consensus in asynchronous environments [FLP85, D DS87].
The rest of the chapter is organized as follows. Section 2.1 gives a specification
of the History system, and Section 2.2 shows the impossibility of constructing this
system in an asynchronous environment with the possibility of failures.
2.1 Specification of the History system
Informally, a History system IL behaves as follows: on receiving a client request, it
just returns the sequence of requests that have been received until then. Clearly,
the semantics of IL are extremely useful as any other deterministic system can
be implemented using IL given an initial state and the sequence of requests that
have been received, any response can be computed depending on the semantics of
the system.
More precisely, in IL, a client c can initiate a request by calling the function
REqUEST(c, op), where op is the next request (assume, for simplicity, that the ar-
gument list is empty). Consider the following (canonical) implementation of this
History system using a single non--Hfaulty server 5 (see Figures 2.1 and 2.2).
function RE?UEST(c, op)
send <c, op> to server 5
when received res from s
return res
Fi ure 2.1: The function RE?UEST
In both the figures, the statement "when G: S" causes the program to block
until the condition G is true, at which point statement S is immediately executed.
15
sequence			e
do forever
when received <c, op>:
sequence sequence 0 op
send sequence to client c
od
Fi?ure 2.2: Protocol executed bv server 5
For example, the condition "received res from 5" causes the function to block until
a response sent by 5 is received from the underlying message system. In addition,
when this response is received, the value of res is set to the response received from
5.
The above implementation guarantees that if a correct client initiates a request,
then the client will eventually receive a response since 5 is non--Hfaulty. We now
require any History system ? to "behave" like the one implemented 011 this non--H
faulty server 5. More formally, let the run of any History system consist of two
kinds of events--H send(c, op, t) if client c initiates REqUEST(c, op) at time t, and
recv(c, op, res, t) if client c receives a response res at time t to the above request.
We assume that a client initiates a new request only after the response to the
earlier request (if any) has been received. Let ::?? be the set of all infinite runs of
an history system implemented on 5, and let ?? be the set of infinite runs of any
other History system TL. Then our criterion for correctness of TL is:
Consistency The History system It is equivalent to the one implemented on the
non--Hfaulty server 5, i.e. ::? c
Notice that consistency captures both safety and liveness.
16
2.2 Impossibility Result
We now show that it is impossible to construct TL in an asynchronous message
passing environment even if at most one of the clients or the servers can fail by
crashing. We assume that between any two processes, there is a reliable FIFO link.
In order to prove our impossibility result, we need to first define the consensus
problem. Consider a system consisting of n > 1 processes, where each process
might be correct or fautty. Each process p has an initial value Vp, and p may
irrevocably decide on a value based on the initial values. Consensus must satisfy
the following three properties:
Termination: Every correct process eventually decides some value.
Validity: If a process decides v, then v is the initial value of some process.
Agreement: No two correct processes decide differently.
It has been shown in [FLP85,DDS87j that there is no deterministic solution to
consensus in an asynchronous environment that is subject to even a single crash
failure. Suppose (for contradiction) now that ?i can be constructed using ?s > 1
servers. We show how to derive a consensus protocol using ? for the n = + 2
process system consisting of the ?s servers and 2 clients (say c?and cj)that tolerates
a single crash failure. The impossibility of consensus now shows that such a system
Ii cannot exist, proving our result.
The two client processes run the consensus protocol in Figure 2.3 (for some
implementation of RE?UEST that ? provides) and each of the fls servers, in ad-
dition of running the protocol corresponding to ?, run the consensus protocol in
Figure 2.4.
17
res			RE?UEST(ci, vi)
if res = Vj then
decide Vj
send (decide, vi) to all servers
else
Let res be of the form v 0 Vj
decide v
send (decide, v) to all servers
Figure 2.3: Protocol run by client Cj with initial v4ue v? (symmetric for ?)
when received (decide, v)
decide v
Figure 2.4: Consensus protocol executed by each server, in addition to executing
the rotocol corres ondin to ? ___________________________________________
A client first sends a request corresponding to its initial value to the service 7t,
and waits for a response. Since ? returns the sequence of requests that have been
received, client Ci (symmetric for cj) can receive only two possible responses--Hv?
or v? 0 v?, where Vj is the initial value of Cj. In the first case, Ci decides on v? and
in the other case it decides on v5 (intuitively, the decision value corresponds to
the initial value of the client who "first" sent its request). Client Cj then sends its
decision value to the servers, and the servers decide on the value they receive.
We now informally argue that the above protocol satisfies Termination, Validity
and Agreement and thus solves consensus. The formal proof appears in the next
section.
Since, there can be only a single failure, at least one client (say ci) is correct
and therefore ci eventually decides. It will send its decision value to all the correct
servers, and so all the correct servers also decide. Thus, Termination holds. Validity
holds trivially since each process can only decide on Vj or Vj. Agreement can be
violated only if the the two clients send different decision values to the servers.
However, this cannot happen because if client Cj receives the response Vj, then cj
can only receive the response Vj 0 Vj, and vice versa.
2.2.1 Formal proof of correctness
?Ve now formally show that the protocol in Figures 2.3 and 2.4 satisfies Termina-
tion, Validity and Agreement and thus solves consensus.
Lemma 2.2.1 The only responses that client Cj (symmetric for cj) can receive are
Vj and Vj 0 Vj.
Proof: Follows from Consistency.
Theorem 1 The protocol in Figures 2.3 and 2.1 satisfies Termination.
E
Proof: From Consistency, a correct client eventually gets a response to REqUEST.
Therefore, from Lemma 2.2.1 and Figure 2.3, a correct client decides.
There must exist a correct client (say ci) since there can be at most one fail-
ure. Since c? decides, it must send (decide, --H) to all the correct servers, who will,
therefore, also eventually decide (from Figure 2.4). Thus, Termination holds. 0
Theorem 2 The protocol in Figures 2.3 and 2.1 satisfies Validity.
0
Proof: Follows from Lemma 2.2.1 and Figures 2.3 and 2.4.
19
Theorem 3 The protocol in Figures 2.3 and 2.4 satisfies Agreement.
Proof: We first show that the two clients cannot decide differently. If Vj = Vj,
then the result holds from Lemma 2.2.1 and Figure 2.3. Therefore, assume that
Vj $ Vj.
If client Cj (symmetric for cj) decides Vj, then from Lemma 2.2.1 and Figure 2.3,
it must have received the response Vj. From Consistency and Lemma 2.2.1, Cj can
only receive the response Vj OVj, and therefore, can only decide on Vj. On the other
hand, if Cj decides Vj, then it must have received the response Vj 0 Vj. In this case,
can only receive the response t'j, and therefore, can only decide on
We next show that the correct servers decide as any correct client (say ci). If
Cj decides on v (v E (vi,vj1), Cj must send (decide, v) to all the servers. By the
above argument, Cj can only decide (if it ever decides) on v, and therefore cannot
send (decide, v'), ?? $ v to the servers. Agreement now follows. E
Thus, we have the following impossibility result.
Theorem 4 There is no implementation of the History system that tolerates even
a single crash failure in an asynchronous environment.
Chapter 3
Specification of the
Primary--HBackup Approach
As we mentioned in Chapter 1, before we can discuss our lower bound and upper
bound results, we need to give a precise specification of the primary--Hbackup ap-
proach. Our specification consists of four properties--HPbl-Pb4. Section 3.1 gives
a formal statement of the properties Pbl-Pb3, but only an informal (and more
intuitive) statement of Pb4. The description in Section 3.1 is sufficient to describe
our lower bounds, and we need a formal description of Pb4 (given in Section 3.2)
only when we discuss upper bounds. The reader may, therefore, choose to omit
Section 3.2 in the first reading, and come back to it later before reading Chapter 5.
3.1 Specification
The lower bound and upper bounds results presented later in this thesis apply to
any protocol that satisfies the following four properties. The first property states
that no more than one server can be the primary at any time.
20
21
Pbl: There exists a local predicate Prmy? on the state of each server s. At any
time, there is at most one server s whose state satisfies Prmy?.
For brevity, whenever we say that `?s is the primary (at time t)" we mean that the
state of s satisfies Prmy? (at time t). Note that we do not specify any constraints
on Prmy? except that each server can locally determine whether it is the primary.
The second property states that a client sends requests only to the server it
believes to be the primary.
Pb2: Each client z maintains a server identity Dest? such that to make a request,
client i sends a message (only) to Dest?.
Pb2 therefore implies that the request can be received by at most one server. Of
course, if client i gets notified that Dest? has failed, then it can update Dest? and
direct subsequent requests to another server--Hthe server pointed to by the new
value of Dest?.
For the next property, we model the communications network by assuming that
client requests are enqueued in a message queue of a server. This property then
states that only the primary can enqueue client requests.
Pb3: If a client request arrives at a server that is not the primary, then that request
is not enqueued (and is therefore not processed by the server).
It may appear that Pb3 and Pbl eliminate the need for Pb2. However, this
is not the case because the primary can change while the client is sending the
(multiple) copies of the request, resulting in the request getting enqueued (and
therefore received) at more than one site. Some of our lower bounds will not hold
in this case.
22
Properties Pbl--HPb3 specify a protocol for client interactions with a service,
but not the semantics of the service. For example, the properties do not rule out
a primary that ignores all requests. We eliminate such trivial implementations as
follows.
A request sent to a primary--Hbackup service can be lost if it is sent to a faulty
primary. Periods during which requests are lost, however, are bounded by the
time required for a backup to take over as the new primary. Such behavior is an
instance of what we call bofo (bounded outage finitely often). Informally, we say
that an outage occurs if some client sends a request but does not receive a response.
A (k, A)-bofo server is one for which all outages can be grouped into at most k
periods, each period having duration of at most A.i The final property of the
primary--Hbackup protocols is that they implement a (k, A)--Hbofo server for some
values of k and A.
Pb4: There exist fixed and bounded values k and A such that the service behaves
like a single (k, A)--Hbofo server.
A more formal statement of Pb4 appears in Section 3.2. Given Pb4, the RE?UEST
function described in Chapter 2 can be implemented by retrying a request until
a response is received. Since k and A are both bounded, such a response will
eventually be received.
3.2 Formal statement of Pb4
We now give a more formal statement of Pb4. We model execution of any client-
server system by a run, which is a sequence of timestamped events involving clients,
`Therefore, as well as being finite, the number of such periods of service outage is also bounded
(by k).
23
servers, and the message queues. These events include sending messages, enqueuing
messages, receiving messages, and modeling computation at processes. Further-
more, define the response time of a system as follows: if a client sends a request and
receives the corresponding response, then the interval between sending the request
and receiving the response is never larger than the response time.
Property Pb4 requires that a primary--Hbackup service behaves like a (k, ?)--Hbofo
server for some values of k and ?. In order to do this, we will define "behaves like"
in terms of indistinguishability as seen by dients--Hto the clients, the responses
from the service appear as if they were sent by some (k, A)-bofo server. One con-
sequence of this definition of indistinguishability is that it requires the clients to see
a response time of 2? (if a response is received) because this would be the response
time of a bofo server in a system with message delivery time of ??2 However, the
response time of a primary--Hbackup service can be greater than 2? because the pri-
mary may need to communicate with the backups before it can respond to a client.
Thus, it would seem that the clients would be able to distinguish the service from a
bofo server. The way we account for this is by adding another parameter, denoted
p, to the definition of a bofo server, and requiring the primary--Hbackup service to
have like a (p, k, A)--Hbofo server. Informally, this means that to the clients, the ser-
vice behaves like a bofo server running is a system with message delivery time of
p. Clierits expect the service to have a response time of 2p, and by defining p to be
large enough, we can make the response time of the primary-backup service equal
to 2p. In fact, our results can be optimized so that the clients expect the service
to have a response time of p + C + 6, where C is some constant that equals the
2For simplicity, we assume in this paper that any bofo server running in a system with message
delivery time of 6 has a response time of 26. However, our results can be generalized to servers
whose response time is some other function of 6.
24
blocking time of the primary--Hbackup protocol. This would, however, require that
we model the bofo server to be running in an asymmetric system--Hthe message
delay from the client to the bofo server is p, and from the bofo server to the client
is C + ?. In order to simplify the presentation, we do not address this optimization.
We now define a (p, k, ??bofo server more precisely. An outage of a bofo server
P occurs at time t iff a client sends a request to P at time t --H p that is not enqueued
by P by time t, or if P attempted to send a response at time t but omitted to do
so. P is a (p, k, ?)--Hbofo server iff in all runs of this server, the message delivery
time is p, and all outages of P can be grouped in at most k intervals, with each
interval having a duration no longer than ?.
Let ?p be the set of runs of any (p, k, ?)--Hbofo server P for some values of p, k
and ?, and ?PB be the set of runs of a primary-backup service PB. Let ap E ?p
and aPB E ?PB be any two runs in the respective sets. ?`e say that for some
client i, ap ?? ap? iff
1. client ? sends a request m at time t in ap iff client ? sends the same request
m at time tin apB and
2. client i receives a response r at time t in ap iff client i receives the same
response r at time tin aPB.
We can now formally state Pb4.
Pb4: Let ?p be the set of runs of any (p, k, ?)-bofo server P for some values of
p, k and A, and ?pB be the set of runs of the primary--Hbackup service PB.
Then ?P : Vap? : apB E ?pB : ?ap : ap E ?p : Vi : ap NNj ap?
Chapter 4
Lower Bounds
In this chapter, we give the lower bounds for the degree of replication, blocking
time and failover time for the primary-backup protocol specification in Chapter 3.
Recall that a run of a client--Hserver system is a sequence of timestamped events
involving clients, servers, and message queues, We say that two runs ai and a2 of
the system are indistinguishable to a process p if the same sequence of events (with
the same timestamps) occur at p in both al and a2. We assume that servers are
deterministic: if two runs ai and a2 are indistinguishable to server s and s has the
same initial state in both runs, then at any time t the state of 5 at tin a1 is the
same as the state of s at tin a2. We make this assumption in order to simplify the
discussion our results hold for non-deterministic servers as well.
We also assume that the clients can send any request at any time. If we impose
restrictions on the behavior of the clients, then we can derive protocols that violate
the lower bounds in this thesis.
Define ? to be the potential causality relation [Lam78] on server events ei and
e2. Thus ? is the transitive closure of the following relation ?: e1 ? e2 iff both
25
26
et and e? occur at the same server 5 and el occurs before e?, or el is a send event
and e? is the corresponding receive event.
?Ve assume that clients can send two kinds of requests--Hupdate and non-update.
Informally, an update request m changes the state of the service in such a way that
the responses to subsequent requests depend on m. On the other hand, if m is
a non--Hupdate request, then responses to subsequent requests need not depend on
rn. More formally? let m be a request with associated response r, and e(m) and
e(r) be the events in the run associated with the receipt of m and the sending of r
respectively. If m is an update request, then for all request/response pairs m'/r',
where m was sent after r, e(m) ? e(r'). ?`e will assume that update requests
exist.
4.1 Bounds on Replication
This section gives the lower bounds on the degree of replication i. e. the relationship
between n? and f. The first bound is obvious.
Theorem 5 Any primary--Hbackup protocol tolerating f crash failures or send-
omission failures requires n? > f + 1.
The following lemma is used in the rest of the theorems in this section.
Lemma 4.1.1 Consider any protocol that satisfies Pb4. Suppose two disjoint and
nonempty sets of servers A and B can be found that meet the following three prop-
erties:
1. There exists a run aa containing R > 2k[?/d\ requests where d is the min-
imum time between the sending of any two client requests (d > 0). Further-
27
more, in this run the servers in A do not crash and all other servers crash
at time 0.
2. There exists a run ab containing R requests. Furthermore, in this run the
servers in B do not crash and all other servers crash at time 0.
3. There exists a run ?ab containing R requests. Furthermore, the servers in A
and B do not crash, ?ab is indistinguishable from aa to all servers in A, and
?a6 is indistinguishable from a6 to all servers in B.
Then, at least one of the above runs violates Pb2.
Proof: Suppose for contradiction that the lemma is false and runs ?a, ?b and
aab all satisfy Pb2.
By Pb4, in aa at least R --H k [?/d\ of the requests must have been received by
servers in A. This is because the number of requests that cannot have responses
must fall into at most k intervals of length at most ?, and each interval of ? can
contain at most [?/(f\ requests.
Similarly, in a?, at least R --H k [?/ff\ of the requests must have been received
by servers in B. Finally, since aab is indistinguishable from 0a to servers in A, they
must execute the same number of receive events in both runs. The same holds for
the servers in B. By Pb2, each request is sent to at most one server and so at least
2(R --H k [?/d\) requests must have been sent in Oab. Since only R requests were
sent, we must have R > 2(R --H k[A/ff\), or R < 2k [AId], which contradicts the
assumption that R >2k [A/ff\.			E]
Theorems 6 and 7 depend on two parameters of primary--Hbackup protocols. Let
F be the maximum time that can elapse between any two successive requests from
2S
non-faulty clients. Let D bound the time it takes for a client to learn the identity
of a new server, so if some server 5 becomes the primary at time to and remains
the primary through time t > to + D when a correct client Cj sends a request, then
Dest? = 5 at time t. We assume that if 1' is bounded then D < F, which implies
that the service must be able to detect the failure of a primary and disseminate the
new primary's identity to the clients without using messages from the clients. The
condition D < F is necessary in order to prove our lower bounds, because as we
show in Section 4.4.1, one can construct a protocol that violates our lower bound
if D > F.
With both send-omission failures and crash+link failures, messages may fail to
reach their destinations. The following theorem shows that crash+link failures are
more expensive to tolerate, because they require more replication.
Theorem 6 Suppose there is at most one link between any two servers. Then any
primary--Hbackup protocol tolerating f crash?hnk failures and having D < F requires
> f +2.
Proof: For contradiction, assume the existence of a protocol P with n? < f +2.
We will show that P has three runs aa, ab and aab that satisfy the conditions of
Lemma 4.1.1. From the lemma, at least one of these runs violates Pb2, which
implies that P cannot be a primary--Hbackup protocol.
Let A be a set containing the one server 5a and let B be the set of remaining
servers. Since IA = 1 and IBI = n --H 1 < f, A and B can become disconnected by
link failures.
We first construct the run aab in which no server crashes, postulating that the
links between the servers in A and B are faulty and do not deliver any messages.
29
As required by Lemma 4.1.1, clients send a total of R > 2k[?/J\ requests. Let
o < d < V --H D be the minimum interval between any two such requests. ?Te
postulate that a request will be sent at time t iff no request has been sent during
the interval [t --H d. .t) and one of the following rules hold.
1. A server 5 is the primary during the interval [t --H D..tj. This request arrives
immediately and is enqueued (at 5, by Pb3 and the definition of D).
2. There is no primary at time t. This request arrives immediately and by Pb3
will never be enqueued at any server.
3. A server s is the primary at time t but another server s? is the primary
immediately after time t. If this request is sent to 5, then it arrives after t,
and if it is sent to any other server, then it arrives immediately. In both cases,
it arrives at a server that is not the primary, and so will not be enqueued
(again by Pb3).
Note that, by construction, the maximum interval between any two client re-
quests is D + d. This interval occurs when a server 5 becomes the primary just
before d after a client message is sent, and 5 remains the primary for at least D.
Hence, the client will be able to send R requests within time R(D + d). This
completes the construction of ?ab.
We now construct aa and a?, recalling that in aa all of the servers except 5a
crash at time 0, and in ab server 5a crashes at time 0. The clients send the same
requests and at the same times in aa and in a? as in aab. Furthermore, we construct
aa and ab in such a way that these requests will arrive at the servers according to
the same rules used in constructing aab.
30
Since 5a does not receive any messages from servers in B in either ?a6 or aa,
these two runs are indistinguishable to 5a as long as it receives the same client
requests at the same times in both runs. NVe show that this is the case by showing
that the same client requests are enqueued, at the same time, at 5a in both the
runs.
Assume that this is true for the first ? --H 1 client requests that are enqueued and
assume that the jth client request m is enqueued at time t. Since the jth request is
enqueued only at time t, the two runs are indistinguishable to 5a until time t.
NVe first show that if rn is enqueued at 5a in ?a6, then it is enqueued at the
same time in aa as well. Since Tn was enqueued at 5a, ? must have been sent by
rule 1. By rule 1, 5a must have been the primary through [t --H D..tl in ?a6 and
therefore, by indistinguishability, in aa as well. By the definition of D, m would
have been enquened at 5a at time tin ?a as well.
Assume now that m was enqueued at sa at time t in aa and m was sent at
time t' < t. Suppose m was sent by rule 2. Then, by construction, there was
no primary at time t'in ?ab, and therefore, by indistinguishability, in ?a as well.
This implies (by Pb3) that m couldn't have been enqueued at ?a, a contradiction.
Now suppose that m was sent by rule 3. First assume that 5 = 5a, where 5 is
the server as defined in rule 3. Then by the construction in rule 3, m arrives
immediately after t' in aa. In ?ab, 5a was not the primary immediately after ??,
and so by indistinguishability, in aa as well. This implies that m couldn't have
been enqueued at 5a, a contradiction. So 5 $ 5a Then, by construction in rule 3,
m arrives immediately. Since 5a was not the primary at time ?? in ?ab (s was the
primary), by indistinguishability, it is not the primary in aa as well. This again
31
implies that m couldn't have been enqueued at 5a, a contradiction.
Therefore, the only possible rule by which m could have been sent is rule 1.
Since m was enqueued at ?a, ?a is the primary at time tin aa (by Pb3). By
indistinguishability, 5a is the primary in ?a6 as well. However, since m was sent by
rule 1, 5a must have been the primary through [t --H D..t] in aab. By the definition
of D, m would have been enqueued at 5a at time t1 in aab as well and we are done.
We can similarly prove that a6 and aaa are indistinguishable to the servers in
B. Thus, by Lemma 4.1.1 P cannot be a primary-backup protocol.
The next theorem states that additional replication is required in order to
tolerate receive-omission failures. The proof is similar to that of Theorem 6, and
so we will omit the details which are common to the two proofs.
Theorem 7 Any primary--Hbackup protocol tolerating f receive?omission failures
and having D < E requires ns > [132J.
Proof: For contradiction, assume the existence of a protocol P with n? < ?132J.
We will show that P has three runs aa, ab and aab that satisfy the conditions of
Lemma 4.1.1. From the lemma, at least one of these runs violates Pb2, which
implies that P cannot be a primary--Hbackup protocol.
Let A and B be two disjoint sets containing up to servers each, and C be
a set containing the rest of the servers.
We first construct the run ?ab in which no server crashes, postulating that the
servers in A and B are faulty by receive?mission (we can do this since lAUBI <f).
The servers in A do not receive any messages from servers in B u C and servers in
B do not receive any messages from servers in Auc. As required by Lemma4.l.l,
32
clients send a total of R > 2k[?/d] requests in 0'ab, and these requests are sent
according to the same three rules as in Theorem 6. This finishes the construction
of 0'ab'
?Ve now construct 0'a and 0'b, recalling that in 0'a all of the servers except those
in A crash at time 0, and in ab all of the servers except those in B crash at time
0. Again, the construction of these two runs is exactly the same as in Theorem 6,
and so we omit the details,
Since 5a E A does not receive any messages from servers in B u c in either 0'ab
or 0'a, these two runs are indistinguishable to 5a as long as it receives the same
client requests at the same times in both runs. ?`e can show that this is the case
by exactly the same argument as in Theorem 6. Similarly for ?6 ? B. Thus, by
Lemma 4.1.1 P cannot be a primary-backup protocol and we are done. E
The next lower bound holds independent of the relation between D and F.
Theorem 8 Any primary--Hbackup protocol tolerating f generat-omission failures
requires n? > 2f.
Proof: Assume for contradiction that there is a protocol for n? < 2f. Partition
the servers into two disjoint sets A and B of size at most f each. We will construct
two runs ai and 0'2. In each run, one set of servers will be faulty and the other set
will be non-faulty.
a1: The servers in A are faulty and fail to communicate with all servers in
B, but behave correctly otherwise. Clients send update requests until the first
response is sent (this must happen, by Pb4). Assume that the first response r to
an update request m is sent at time t. Say that this response is sent by server 5.
33
a2: The same as ai up to time t, but if s is in B, then in ?2 it is the servers in B
that are faulty and fail to communicate with all servers in A rather than the servers
in A that are faulty. In either case, r is sent by a faulty server. Furthermore, no
server can distinguish al from a2 through time t and therefore, the first response
r is sent at time tin ?2 as well.
Let all of the faulty servers in a? crash immediately after r is sent and have
I.
clients continue to send requests until another response r is sent. This response
must have been sent by a non-faulty server which implies that ?(e(rn) ?
However this violates the fact that m is an update request. c
4.2 Bounds on Blocking
This section gives the lower bounds on blocking times. Informally, a blocking
primary-backup protocol is one in which the primary must, after receiving a request
m, either receive a message from another server or simply wait an interval before
it can respond to m. More formally:
Definition 4.2.1 C--Hblocking protocol: Consider a failure-free run of a primary-
backup protocol. Let the time that a request is received be trn, and the time that
the corresponding response is sent be tr. We say that this protocol is C--Hblocking
if tr - tm <C.
For example, any primary-backup protocol in which the primary sends informa-
tion about a request to the backups and waits for acknowledgement before sending
the response to the client will be at least 2?--Hblocking.
34
?Ve now give the lower bounds on C for the various models of failures. However
before we do so, we prove the following lemma.
Lemma 4.2.1 Consider a primary-backup system in which D < F and the n?
servers can be divided into two disjoint sent 5 and 5,. Assume that two runs ?
and a' exist with the following properties:
o+ The two runs are identical until time t1.
o+ In a, the servers in 5' crash at time t1.
o+ In a?, the servers ?n 5 do not get any messages from servers in 5, that were
sent after time t1.
o+ Let t2 be some time greater than t1. In the interval [t1..t2J, clients send
the same requests (at the same time) in both a and a'. Furthermore, these
requests are sent to the same servers in both the runs, and they arrive at the
same times in both the runs.
o+ After time t2, the clients send requests in a until a response occurs. These
requests ares sent at intervals of at least d where 0 < d < F --H D as follows.
A request is sent at time t f no request has been sent in [t --H d. .t) and one of
the following rules hold.
1. A servers E 5 is the primary during the interval [t--HD..t1. This request
arrives immediately and is enqueued (at 5, by Pb3 and the definition of
2. There is no primary in 5 at time t. This request arrives immediately
and by Pb3 will never be enqueued at any server in B.
35
3. A server 3 E S is the primary at time t but another server 3' Ei S is
the primary immediately after time t. ff the request is sent to 3, then it
arr'ves after t, and if it is sent to any other server it arrives immediately,
In both cases, it arrives at a server that is not the primary, and so will
not be enqueued (again, by Pb3).
o+ After time t2, the clients send requests in o" at the same times as they were
sent in a. Furthermore, these requests arrive according to the same three
rules above.
Then, the runs a and a' are indistinguishable to the servers in 5.
Proof: Since 3b Ei 5 does not receive any messages from servers in 5, that were
sent after t1 in either a or a', these two runs are indistinguishable to 5b as long as
it receives the same client requests at the same times in both runs. We show that
this is the case by showing that the same client requests are enqueued, at the same
time, at 5b in both the runs.
Assume that this is true for the first i --H 1 client requests that are enqueued and
assume that the jth client request rn is enqueued at time t. Since the ith request is
enqueued only at time t, the two runs are indistinguishable to 5b until time t.
We first show that if m is enqueued at s?in a, then it is enqueued at the same
time in ?? as well. This is easily true for any request sent until time t2, because
these requests are sent to the same servers in both the runs. So we assume that rn
was sent after t2. Since Tn was enqueued at 56, ? must have been sent by rule 1 in
the lemma. By rule 1, sb must have been the primary through [t --H D..tj in a and
therefore, by indistinguishability, in a1 as well. By the definition of D, m would
have been enqueued at ?b at time t in ?? as well.
36
Assume now that m was enqueued at s6 at time tin a' and m was sent at time
< t. Suppose m was sent by rule 2. Then, by construction, there was no primary
in S at time t' in a, and therefore, by indistinguishability, in a' as well. This implies
(by Pb3) that m couldn't have been enqueued at ?6, a contradiction. Now suppose
that m was sent by rule :3. First assume that a = 5b, where a is the server as defined
in rule 3. Then by the construction in rule 3, m arrives immediately after t' in a'
In a, 5b was not the primary immediately after t', and so by indistinguishability, in
a' as well. This implies that m couldn't have been enqueued at 5b, a contradiction.
So a $ 5b Then, by construction in rule 3, rn arrives immediately. Since ?6 was
not the primary at time t'in a (a was the primary), by indistinguishability, it is not
the primary in a' as well. This again implies that m couldn't have been enqueued
at a?, a contradiction.
Therefore, the only possible rule by which m could have been sent is rule 1.
Since m was enquened at 5b, ?b is the primary at time t in a' (by Pb3). By indis-
tinguishability, 5b is the primary in a as well. However, since m was sent by rule
1, 5b must have been the primary through [t --H D..t] in a. By the definition of D,
m would have been enqueued at 5b at time t' in a as well and we are done. E
Crash failures and crash+link failures have the trivial lower bound of C = 0.
We, therefore, start with the lower bound for receive?mission failures.
Theorem 9 Any primary--Hbackup protocol tolerating f receive-omission failures
with f> 1, fls < 2f and D < F is C-blocking for some C > 2?.
Proof: For contradiction, suppose there is a primary--Hbackup protocol for ?s ?
2f and f > 1 that is C--Hblocking where C <2?. Partition the servers into two sets
37
A and B where IA = f and IBI = n --H f < f. We construct three runs. In all
three runs, assume that all server messages take 6 to arrive.
a1: There are no failures and all client requests take 6 to arrive. Moreover,
clients send update requests until some update request m evokes a response r. Let
rn be received at time tm by server 5a e A and r be sent at time tr by a server
5a' ? A (5a' could be the same as sa). Notice that since the protocol is C--Hblocking
where C < 26, tr --H tm < 26. Also since, by construction, all requests take 6 to
arrive, all client requests sent after time tm + 6 will be received after time tr.
a2: Identical to al until 5a receives m at time tm. At this point in a?, all servers
in A are assumed to crash and clients are assumed to send no request during the
interval [tm + 6..tr]. Finally, after time tr clients are assumed to repeatedly send
requests according to?the three rules in Lemma 4.2.1 with B = S and A = 5'
As required by Lemma 4.2.1, there will be a response (say r') in a?, and by
construction it must be from a request sent by rule 1.
a3: The same as a?, except that the servers in A do not crash at time tm.
Instead, the servers in B commit receive failures on all messages sent after tm by
servers in A. Clients send requests at the same times as in a? which arrive using
the same rules as a?.
Now, consider these three runs. By construction, the runs are identical up to
time trn. Since all server messages take 6 to arrive, clients cannot distinguish ai
and a? through tm +6, and so clients send the same requests to the same servers in
both ai and a? through tm +6. Similarly, since all server messages take 6 to arrive,
the servers in B cannot distinguish between al and a? through trn +6. Therefore,
since tr --H tm < 26, 5a (the server that received request m at time tm in ai) and
38
5a' (the server that sent response r at time tr in at) cannot distinguish between
at and a? through time tr, and ?? 5a' sends response r in a? as well. We will now
show that servers in B cannot distinguish a2 and a3, and so response r also occurs
in a3. However, ?(e(m) ? e(r')) which will violate the assumption that m is an
update request.
However, a? and a? can be shown indistinguishable to the servers in B by using
Lemma 4.2.1 with 5 = R, 5' = A, t1 = tm, t2 = tT, a = a? and a' = a?. All the
conditions of the lemma can be seen to be true, and we are done. E
In run a? of the above proof, a correct primary (5a in set A) becomes the
backup, while a faulty server from set B becomes the primary in 5a'5 place. We
show below that it is always possible to construct such a run.
Theorem 10 Any primary--Hbackup protocol tolerating f receive-omission failures
with ?s < 2f and D < F has a run in which a correct primary cedes to a faulty
backup.
Proof: Partition the servers into two sets A and B where IA = f and IBI =
n --H f < f. We construct three runs. In all three runs, assume that all server
messages take 6 to arrive.
at: There are no failures and all client requests take 6 to arrive. Moreover,
clients send requests until some request m evokes a response r. Let m be received
at time tm by server 5a C A.
a2: Identical to at until 5a receives m at time tm. At this point in a?, all servers
in A are assumed to crash. After time tm, clients are assumed to repeatedly send
requests at intervals of at least d where 0 < d < F --H D and these requests are sent
39
according to the three rules in Lemma 4.2.1 with 3 = B and 3' = A.
As required by the lemma, there will be a response (say r') to some request ?`
in a?. Assume that ?? was received by some server ?6 E B.
a3: The same as a?, except that the servers in A do not crash at time tm.
Instead, the servers in B commit receive failures on all messages sent after tm by
servers in A. Clients send requests at the same times as in a? which arrive using
the same rules as a?.
Now, consider these three runs. By construction, the runs are identical up to
time tm. Furthermore it can be seen that all the conditions in Lemma 4.2.1 are
satisfied for 3 = B, 3? = A, t1 = tm, t2 = tm, a = a? and a' = a?. There-
fore, a? and a3 are indistinguishable to ?6 and so response occurs in a? as well.
However, this implies that in a? there was a primary 5a in A before time tm, and
there was a primary 56 Ei B after tm. Since 5a is correct and ?6 is faulty, are done. C
The above theorem shows a disconcerting property: there does not exist a
primary--Hbackup protocol that tolerates receive-omission failures with fls < 2f in
which a primary cedes only when it fails. Moreover, this lower bound is tight--Hin
Chapter 5, we give a receive-omission primary--Hbackup protocol with n? = 2f + 1
in which a primary cedes only when it fails.
And, if f = 1, then the following theorem holds: its proof is similar to the proof
of Theorem 9, except that 5a =
Theorem 11 Any primary--Hbackup protocol tolerating receive-omission failures with
f = 1 and n3 < 2f and having D < V is C-blocking for some C > 6.
40
Proof: For contradiction, suppose there is a primary--Hbackup protocol for ?s = 2
and f = 1 that is C--Hblocking where C < ?. Partition the servers into two sets A
and R where A = 1 and IBI = 1. We construct three runs. In all three runs,
assume that all server messages take ? to arrive. Clearly, in all these runs, if a
server p receives a request and a server q sends the corresponding response, then
p = q as otherwise the assumption C < ? will be violated (the message from p to
q will take ? to arrive).
al: There are no failures and all client requests take 6 to arrive. Nioreover,
clients send update requests until some update request m evokes a response r. Let
rn be received at time tm by server ?a E A and r be sent at time tr by Sa. Notice
that since the protocol is C--Hblocking where C < 6, tr --H tm < 6. Also since, by
construction, all requests take 6 to arrive, all client requests sent after time tm will
be received after time tr.
a2: Identical to a1 until 5a receives m at time tm. At this point in a2, the
server in A (which is 3a) is assumed to crash and clients are assumed to send no
request during the interval [tm. .tr]. Finally, after time tr clients are assumed to
repeatedly send requests according to the three rules in Lemma 4.2.1 with B = S
and A = 5'
As required by Lemma 4.2.1, there will be a response (say r') in a?, and by
construction it must be from a request sent by rule 1.
a3: The same as a2, except that the servers in A do not crash at time tm.
Instead, the server in B commits receive failures on all messages sent after tm by
the server in A. Clients send requests at the same times as in a? which arrive using
the same rules as a?.
41
Now, consider these three runs. By construction, the runs are identical up to
time trn. Since all messages sent until time tr take ? to arrive and tr --H tm < ?, 5a
(the server that received request rn at time t? in ai) cannot distinguish between
al and a? through time tr, and so 5a sends response r in a? as well. We will now
again show that the server in B cannot distinguish a2 and a3, and so response
also occurs in a?. However, ?(e(m) ? e(r')) which will violate the assumption that
rn is an update request.
However, a2 and a? can be shown indistinguishable to the server in B by using
Lemma 4.2.1 with S = B, 3' = A, t1 = trn, t2 = tr, a = a? and a' = a?. All the
conditions of the lemma can be seen to be true, and we are done. 0
Primary--Hbackup protocols tolerating send-omission or generalomission failures
exhibit the same blocking properties as those tolerating receive-omission failures,
except that the restriction D < F is no longer necessary. Here we prove just the
results for send?mission failures. The results for general?mission failures then
follow.
Theorem 12 Any primary--Hbackup protocol tolerating f send-omission failures
with f > 1 is C--Hblocking for some C > 2?.
Proof: For contradiction, suppose there is a primary--Hbackup protocol that is
C-blocking where C < 26. We consider the following two runs in which all server
messages take 6 to arrive.
al: There are no failures and all client requests take 6 to arrive. Moreover,
clients send update requests until some update request m evokes a response r.
Let m be received at time tm by server 5a and r be sent at time tT by a server
44
then one can construct protocols that violate our lower bounds. The execution of
one such protocol is given in Section 4.4.2.
Theorem 14 Any primary--Hbackup protocol tolerating f crash failures must have
a failover time of at least f6.
Proof: The proof is by induction on
Base case f = 0: trivially true since a failover time cannot be smaller than zero.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures,
but (for a proof by contradiction) there is a protocol P for which the theorem is
false when there are f failures. From the induction hypothesis, there is a run a
with at most f --H 1 failures and an interval [t0..t1] at least (f --H l)? during which
there is no primary. Let 5i be the server that becomes the primary at t1. Consider
the two runs a1 and a? that extend a as follows:
a1: Assume ?1 crashes at time t1. By assumption, there exists a new primary
(say 52) at time t2 < t1 + ?. Since 51 crashes at time ti, ?2 does not receive any
messages from ?1 that were sent after time t1.
a2: Assume that ?1 is correct, there are no other crashes at or after ti and all
messages sent by ?1 after time t1 take ? to arrive.
Since ?2 cannot distinguish al from a? through time t2, ?2 becomes the primary
at time t2 in a?. By Pb5, however, s? remains the primary at time t2 in a?. This
violates Pbl, and so P is not a primary--Hbackup protocol. c
Failover times for all other failure models have a larger lower bound:
45
Theorem 15 Any primary--Hbackup protocol tolerating f crash#link failures has a
failover time of at least 2f?.
Proof: The proof is by induction on
Base case f = 0: trivially true.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures,
but (for a proof by contradiction) there is a protocol P for which the theorem is
false when there are f failures.
From the induction hypothesis, there is a run a with at most f --H t failures and
an interval [t0..t1] at least 2(f --H l)? during which there is no primary. Let si be
the server that becomes the primary at t1. Consider the three runs a1, a2 and, a?
that extend a as follows:
ai: Assume that si crashes at time ti and all messages sent after ti take ?
to arrive. Furthermore, the crash of si is the only failure at or after t1. By
assumption, there exists a new primary (say 52) at time t2 < ti + 2?. Since ?i
crashes at time ti, ?2 does not receive any messages from ?i that were sent after
time t1. Furthermore, since all messages take 6 to arrive, any message that was
sent at or after after ti + 6 can be received by ?2 only after time t2.
a2: Assume that ?i is correct, there are no other failures at or after ti, and all
messages sent after time ti take 6 to arrive. Since there are no failures at or after
time ti, by PbS pi continues to be the primary through time t2.
a3: The same as a? except that the link between si and ?2 is faulty and does
not deliver any message sent by si to ?2 after time t1.
46
By construction, 52 cannot distinguish ?i from a? through time t2, and so 5?
becomes the primary at time t2 in ?. Similarly, 51 cannot distinguish a? from a?
through time t2 and so ?i remains the primary until time t2 in a?. This violates
Pbi, and so P is not a primary--Hbackup protocol.
The proofs of Theorems 16 and 17 are similar to Theorem 15.
Theorem 16 Any primary--Hbackup protocol tolerating f receive-omission failures
has a failover time of at least 2f6
Proof: The proof is by induction on f
Base case f = 0: trivially true.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures,
but (for a proof by contradiction) there is a protocol P for which the theorem is
false when there are f failures.
From the induction hypothesis, there is a run a with at most f --H l failures and
an interval [t0..t1] at least 2(f --H l)? during which there is no primary. Let si be
the server that becomes the primary at t1. Consider the three runs a1, a? and, a3
that extend a as follows:
ai: Assume that s? crashes at time ti and all messages sent after ti take 6
to arrive. Furthermore, the crash of s? is the only failure at or after ti. By
assumption, there exists a new primary (say 52) at time t2 ? t1 + 26. Since s?
crashes at time ti, ?2 does not receive any messages from s? that were sent after
time t1. Furthermore, since all messages take 6 to arrive, any message that was
sent at or after t1 + 6 can be received by ?2 only after time t2.
47
?2: Assume that 51 is correct, there are no other failures at or after t1, and all
messages sent after time ti take ? to arrive. Since there are no failures at or after
time t1, by Pb5 51 continues to be the primary through time t2.
a3: The same as a2 except that ?2 is faulty and does not receive any message
sent by ?1 after time t1.
By construction, ?2 cannot distinguish a? from a? through time t2, and 50 ?2
becomes the primary at time t2 in a?. Similarly, 51 cannot distinguish a? from a?
through time t2 and 50 51 remains the primary until time t2 in a?. This violates
Pbl, and so P is not a primary-backup protocol. E
Theorem 17 Any primary--Hbackup protocol tolerating f send-omission failures has
a failover time of at least 2f?.
Proof: The proof is by induction on
Base case f = 0: trivially true.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures,
but (for a proof by contradiction) there is a protocol P for which the theorem is
false when there are f failures.
From the induction hypothesis, there is a run a with at most f --H 1 failures and
an interval [t0..t1] at least 2(f --H 1)? during which there is no primary. Let ?1 be
the server that becomes the primary at t1. Consider the three runs ?1, a2 and, a?
that extend a as follows:
ai: Assume that ?1 crashes at time t1 and all messages sent after t1 take ?
to arrive. Furthermore, the crash of ?1 is the only failure at or after t1. By
48
assumption, there exists a new primary (say 52) at time t2 < ti + 2?. Since si
crashes at time ti, 52 does not receive any messages from pi that were sent after
time t1. Furthermore, since all messages take ? to arrive, any message that was
sent at or after t1 + 6 can be received by ?2 only after time t2
a2: Assume that si is correct, there are no other failures at or after ti, and all
messages sent after time ti take 6 to arrive, Since there are no failures at or after
time t1, by Pb5 ?i continues to be the primary through time t2.
a3: The same as a2 except that si is faulty and does not send any messages to
?2 after time t1.
By construction, ?2 cannot distinguish al from a3 through time t2, and 50 52
becomes the primary at time t2 in a?. Similarly, 51 cannot distinguish a? from a?
through time t2 and so ?i remains the primary until time t2 in a3. This violates
Pbl, and so P is not a primary-backup protocol. E
Clearly from Theorem 16 or Theorem 17, the lower bound on failover time for
generaI?mission failures is also 2f6.
4.4 Summary
Table 4.1 summarizes all of our lower bound results.
4.4.1 Protocol having D> F
We now describe a protocol that has D > F, and violates our crash+link lower
bound for degree of replication. For this protocol, f = 1 and ?s = 2, rather than
= 3 as required by Theorem 6 (refer to Figure 4.1).
There are two servers si and ?2, with si being the initial primary. When 5i
receives a request from a client c, it computes the response, updates its state,
49
Table 4.1: Summary of lower bound results
Failure			Degree of			Blocking			Failover
model			replication			time			time
crash			fls > f			0
crash+link			?s > f + 1			0			2f6
?			if n8 = 2f and f = 1
?s> [?j t			2?			if n8 ? 2f and f> it			2f?
general			?			if f = 1
omission fls > 2f			2?			iff> 1			2f?
D <F assumed.
c
si
M r			2?+?+F
A
y
1'
y
y			______________
2
?3
t4
H ? m- F
Figure 4.1: Protocol having D> F
50
sends a state update message to ?2, and then sends the response to the client. It is
possible that due to link failures, ?2 does not get a state update message. However,
as we will see, in this case ?2 will never process any requests from the client.
Server ?i also sends periodic ("I am alive", r) messages to ?2 at time ?r
for ? = 0, 1, 2,... (dashed arrow in the figure). Assume, for this protocol that
r > 2?. Server 5i then waits for 52 to acknowledge these "I am alive" messages.
Furthermore, ?1 processes no client requests during the period it is waiting for s2
to acknowledge these messages.
If s? does not get an acknowledgement to an ("I am alive",??) message (in the
figure, ti = ir) by time t2 = ir + 2?, then ?i is correct because f = 1 and either s?
has crashed, or the link between si and ?2 has failed. However, in case the link has
failed, then ?2 can also assume that si has failed. Therefore, in order to inform ?2
that ?i has not crashed, si uses the client requests to inform ?i of this fact. Server
51 first becomes the backup, and then sends a (Dest = 52) message to the client
(message 1 in the figure), informing the client to send further requests to ?2 As
we will see in the next paragraph, ?2 will receive client requests and conclude that
si has not crashed. At time t4 = (I + l)r + 26 + 6 + F, after giving enough time to
?2 to find out that 51 has not crashed, St agains becomes the primary and sends a
(Dest = si) message to the client (message 3 in the figure). Since si is correct, it
now continues to be the primary.
We now describe what ?2 does in case it does not get a ("I am alive",?r) message
by time ir + 6. Server ?2 becomes the primary at time t2 = ir + 26. Clearly, St
will not get an acknowledgement to the ("I am alive",?r) message (as shown in the
figure). However, it is possible that ?t did not get an acknowledgement to the ("I
am alive?,(? --H l)r) message as well because the acknowledgement that 52 sent got
lost due to a link failure. In either case, it can be seen from the description in the
previous paragraph that si would have become the the backup by time tr + 26,
and cannot become the primary again before time ir + 26 + 6 + F. Thus if s?
51
has not crashed, then Pb1 will not be violated as long as ?2 does the remain the
primary beyond ir +26+6+ F. We will show that this is the case. After becoming
the primary at time iT + 26, ?2 waits until time t3 = ir + 26 + 6 + V to receive a
client request. Again, from the previous paragraph, ?2 will receive such a request
(message 2 in the figure) if ?i has not crashed because si would have informed the
client by time ir + 26 to send future requests to ?2 If 52 receives such a request,
then it concludes that si has not crashed, and immediately becomes the backup.
On the other hand, if ?2 does not receive such a request, then si must have
crashed. In this case, 52 is correct, and so it continues to be the primary. It sends
a (Destj = 52) message to the client so that further requests are sent to ?2 Thus
D = 6 + V + 6 i.e. D > V. Furthermore, since si is faulty, the link between
51 and ?2 must be correct, which implies that ?2 must have received all the state
update messages that ?i would have sent before it crashed. Thus, the state of ?2
is consistent with the state of the previous primary, and so it can process requests
from the client.
4.4.2 Protocol violating Pb5
As we had mentioned earlier, PbS is necessary in order to prove our lower bounds.
We show this by giving a crash failure protocol for ?s = 3 and f = 2 that does not
satisfy PbS and has a failover time of less than 2f6 (refer to Figure 4.2).
The three servers are 51, ?2 and 53 with si being the initial primary. Fur-
thermore, si will remain the primary until it crashes. When ?i receives a request
from a client c, it computes the response, updates its state, sends state update
messages to 52 and 53, and then sends the response to the client without waiting
for acknowledgements from the backups. For crash failures, a backup can detect
the failure of a primary by the lack of an expected message from the primary. For
example, if the clocks of the servers are perfectly synchronized, then the primary
can send a ("I am alive", r) message to the backups at time ?r for ? = 0,1,2
52
c
\\ ti
my
A
53
7$
x
\			t2			?
iY4			\\\
D
7			7
4 2			I
MM?y%X Mmm??i1y774
______At
ff
6
Figure 4?2: Protocol with f = 2 and Failover Time < 26
If a backup does not receive this message by time ? +6, then the primary crashed
at some time tz in the range (? --H 1)T < tx <T. The backups detect the failure of
the primary by using this failure detection technique. The periodic "I am alive"
messages are shown as dashed arrows in the figure.
If at some time ?r + 6, 52 has not received an ("I am alive",?r) message from
si, then ?2 knows that ?i crashed at some time t1: (? --H 1)T < ti < ?r and 50 ?2
becomes the primary (time t2 in the figure). Furthermore, if 53 has not crashed,
then it will also detect si's failure by not receiving an ("I am alive",?r) message
from si where ? --H 1 < ? ? + 1. In the figure, ?i has crashed just after sending
the "I am alive" messages, receiving a client request, and sending the state update
message to 53 but not ?2
The new primary 52 remains the primary only through the interval [?r + & .?r +
6 + ?), for some ? > 0, even if no more failures occur. This violates PbS. During
this interval, ?2 processes no requests that it receives from the clients. However,
53
it remembers these requests (message 1 in the figure) and forwards them to be
processed by 53 (message 2 in the figure).
At time (? + 1)r + 6 + ? (time t3 in the figure), 53 becomes the primary if it
has not already crashed. Since ? + 6 + ? < (? + 1)r + 6 + ?, 52 is no longer the
primary at this time as required by Pb1. 53 first sends its state to ?2 (message 3
in the figure) in order to make the states of ?2 and 53 equal. The new primary 53
then starts processing client requests, including those forwarded by ?2 As before,
it sends a state update message first to ?2 (message 4 in the figure) and then to
the clients (message 5 in the figure).
Finally, after t2 + ?, ?2 updates its state on receiving the state and state update
messages from 53. Sometime later, if ?2 stops receiving `?I am alive" messages
from 53, ?2 again becomes the primary. It first processes any requests that it
had forwarded earlier to 53 but 53 did not process before crashing. Thereafter, ?2
simply processes new requests from the clients since there are no backups.
The longest period during which there will be no primary occurs when si crashes
at time t1 = (? --H 1)r, ?2 crashes at time t2 = ?r + 6, and 53 becomes the primary
at time t3 = ?r + 6 + e, giving a failover time of T + 6 + e. If r + ? < 6 then the
failover time is less than 26.
Chapter 5
Optimal Primary-Backup
Protocols--HPart I
In the following two chapters, we show which of our lower bounds derived in Chap-
ter 4 are tight. The protocols (we call these ?-blocking protocols) showing that the
lower bounds on blocking time for send?mission, receive??mission and general--H
omission failures are tight when f = 1 are quite different from the protocols that
show the other lower bounds to be tight. We, therefore, address these 6--Hblocking
protocols in the next chapter (Chapter 6) and concentrate on the remaining pro-
tocols here.
5.1 Protocols for the Clients and the Bofo
Server
Property Pb4 requires that the primary-backup service behaves like some bofo
server. Figure 5.1 gives a canonical server P, and our primary--Hbackup protocols
implements a (p, k, ?)A)ofo version of P for some values of p, k and ?. Figure 5.2
54
gives the protocol that a client? uses to communicate with the bofo version of P.
In both the figures, the statement "`when 0: S" causes the program to block until
the condition 0 is true, at which point statement 8' is immediately executed. For
example, the condition "received m from client c" causes a process to block until
it gets a message m from the underlying message system. In addition, when this
particular when terminates, the variable ccontains the identifier of the client that
sent the request.
initialize()
inform--Hclients("Dest=
do forever
when received request from client c:
response fl(state,request)
state = state0response
send response to client c
od
procedure initialize()
state := e
procedure inform--Hclients(ic)
send (ic) to all clients
Fi?ure 5.1: Protocol run bv a canonical server P
5.2 The Primary--HBackup Protocol Schema
We first make the simplying assumption that the links between the clients and
the servers are non--Hfaulty and there are no omission failures between the clients
and the servers (i.e. only the links between servers can be faulty when assuming
crash+link failures, and omission failures can occur only between servers when
assuming omission failures). We show later how this assumption can be removed.
56
cobegin
I do forever
when received ?Dest = s": Dest?
od
I do forever
od
coend
if want to send request
send request to Des4
if not received response within 2p then
recover() /? call some recovery procedure, which might retry ?/
else
Fi?ure 5.2: Protocol run by client i interactin? with server
Our primary--Hbackup service consists of ?s servers ....... snsj, each of which
runs the protocol in ?igure 5.3. The protocol run by the clients is the same as
the protocol for the bofo--Hserver which is shown in Figure 5.2, with p = ncC + 6.
NVe need this large value of p because, as we will see, our implementation of the
primary--Hbackup service has a response time larger than C + 26 whenever C > 0.
Also, note that this is just one possible value of p. For example, it is possible
to define p such that it is independent of the number of clients, if we make the
assumption that at any time, there are only a bounded number of requests in the
message queue of the primary.
initialize(i)
cobegin
I if i = 0 then primary(i) else backup(i)
I I delivery--Hprocess(i)
II failur?detector(?)
coend
Fi ure 5.3: Protocol run by?seryer Sj to emulate server P
57
The procedures primary and backup (shown in Figure 5.4) are the same
for all the failure models, On the other hand, the implementation of the pro-
cedures delivery--Hprocess, failure--Hdetector, initialize and broadcast (used
in Figure 5.4), depend on the particular failure model being assumed. However.
we ensure that these different implementations always satisfy a set of properties,
listed below as BO Bi 1. ?Ve extracted these properties in order to make our
proofs modular. In particular we prove that, independent of the failure model, the
protocol in Figures 5.3 and 5.4 satisfies Pbl-Pb5, provided the four procedures
satisfy BO--HBil.
In the properties BO--HB 11, the values of the constants d, C and r depend on
the failure model. Intuitively, d corresponds to the amount of time that can elapse
from the moment a message is broadcast to the moment it is dequeued by the
receiver, C corresponds to the blocking time and r corresponds to the periodicity
of probing in the underlying failure detector.
?Vhen we say that a server "halts", we mean that either the server has crashed
or has stopped executing the protocol by executing a stop. The array of booleans
Fau1ty? indicates which servers sk believes have halted: Fa?Ity?[sj1 being true
means that s? believes sj has halted. ?Ve define a broadcast by a server to be
successfulif the server does not halt during the corresponding execution of broad-
cast. Finally, whenever, we refer to "dequene in the following properties, we
mean dequeueing from the queue Rqueue?.
The properties can be subdivided according to the procedures they relate to:
58
procedure primary(j)
inform--Hclients("Dest =
broadcast((myiastiog, 55,last(statej)),j)
/* broadcast(iN4,j) sends M to all servers, the sender being 55 ?/
do forever
when received request from client c:
response := fl(statej,request)
state5 := state5 0 response
broadcast((log, 55,response),j)
send response to client c
od
procedure backup(k)
do forever
((tag,s5,r),j) := Deq(Rqueue?)
/? The messages that are broadcast are enqueued in Rqueue?
/? Assume that dequeueing an empty queue
does not return any sensible value of tag ?/
/* synchronizing with the new primary ?/
iftag = mylastlogthen
ifr ? state? then
ifr = last(state?) then skip
else state? := state? \ last(state?)
else state? := state?0 r
/* logging response from primary ?/
iftag = log then state? := state?0 r
/* becoming the primary *1
ifVi < k : Fau1ty?[si? then primary(k)
od
Figure 5.4: Procedures primary and backup
59
Properties of broadcast and delivery--Hprocess:
BO: If s? dequeues 6, then 6 must have been broadcasted before; and s? dequenes
6 at most once.
Bi: If s? initiates a broadcast 6' after it initiates broadcast 6, then no server
queues 61 before it dequeues 6.
B2: If Sj initiates a broadcast 6 at time t, then no server dequeues 6 after time
+ d.
B3: No successful broadcast takes longer than C to complete. Furthermore,
o+ if C = 0 and a? initiates a broadcast at time t and only halts (if it ever
halts) after time t, then the broadcast is successful.
o+ if C> 0 and 5j initiates a broadcast at time t and does not halt by time
t + C, then the broadcast is successful.
Properties of failur?detector:
B4: If Fau1tyj[s?1 becomes true, then it remains true, as long as Sj does not halt.
B5: The value of Fau1tyj[s?1 can only change at times t = ir + d for some integer
> 0.
B6: If Fau1tyj[a?] = tr??e at time t then ?k has halted by time t.
B7: If sk halts by time t, and Sj does not halt by time t+r+d, then Fau1tyj[s?? =
trt?e by time t + r + d.
Properties of broadcast and delivery--Hprocess interacting with failur?detec-
tor:
60
138: No correct server halts in procedures initialize, broadcast, delivery--Hprocess
or failure--Hdetector.
139: If 5? initiates a successful broadcast at time t, then for all non--Hhalted servers
5k? k > j, Faulty?[sj] = false through time [rtl? + d.
1310: If s? initiates a successful broadcast b, then for every non--Hhalted server 5?:
(Faulty?[sj] = true) ? (5k has dequeued b).
1311: If 5? initiates a broadcast b at time t and 5k? k > j initiates a broadcast b'
then either no server dequeues b after b', or Faulty?[sji = false through time
+ d.
5.2.1 Proofof Correctness
We now show that the protocol in Figures 5.3 and 5.4 satisfies Pbl--HPb5 if the
procedures initialize, broadcast, delivery--Hprocess and failure--Hdetector sat-
isfy BO Bli. We will use the notation t k to denote the time 5k becomes the
primary and ? k to denote the time 5k halts. Note that, from Figure 5.4, t k is
also the time when Vi < k : Faulty?[sj] = true first holds.
Lemma 5.2.1 Let sj receive a client request m at time t.
o+ if C = 0 and sj only halts (f it ever halts) afler time t, then sj sends the
response to m by time t.
o+ if C > 0 and sj does not halt by time t + C, then sj sends the response to m
by time t + C.
0
Proof: Follows from Figure 5.4 and 133.
61
Lemma 5.2.2 At any time, there can be at most one request from any client i in
the queue of any server
Proof: Assume, for contradiction, that t2 is the first time that there are two
requests (say mi and m?) from some client i in the queue of sj. Assume mi was
sent at time t1 and so it was enqueued by ti + 6. Furthermore, since m2 is the
first duplicate request, there are at most Ttc --H 1 requests ahead of m1 in s?'s queue.
Therefore, from Figure 5.4 and B3, 5? will dequeue m1 by time ti + 6 + (nc --H l)C.
Therefore t? < ti+6+(n?--H1)C. However, since 2p = 2(ncC+6), from Figure 5.2,
client i cannot send m? before timeti+2(n?C+6) i.e. t9 > ti+2(ncC+6), which
is a contradiction.			c
Lemma 5.2.3 If a request m is sent by a client at time tm to server Sj, and sj
rema'ns the primary through the interval [trn..tm + p], then m is received by sj by
time t,n + p.
Proof: Since the message delivery time is at most 6, m is enqueued at Sj by time
tm + 6. From Lemma 5.2.2, there can be at most n? --H 1 requests ahead of m in 5j's
queue. Therefore, from Figure 5.4 and B3, m is received by time tm+(nc--H 1)C+6.
Since p = ncC + 6, the lemma holds.			E
Lemma 5.2.4 If a client sends a request at time t and a response is sent, then
the response will arrive at the client by time t + 2p.
c
Proof: Follows from Figure 5.4, Lemma 5.2.3 and B3.
62
Lemma 5.2.5 iVo correct server ever halts.
Proof: Follows from Figure 5.4 and B8.
c
Lemma 5.2.6 Jf sj initiates a broadcast b and 5k, k > j initiates broadcast b',
then no server dequeues b afler b'-
Proof: If Faulty?[sj] = true before time t + d, then the lemma follows trivially
from Bil. If Faulty?[sj] = false through time t + d, then from Figure 5.4, 5k
cannot become the primary before time t + d, and hence does not broadcast b'
before t + d. The lemma follows from BO and B2. 0
Lemma 5.2.7 If sj initiates a successful broadcast b, then for each non--Hhalted
server 5k, ? < k: 5k dequeues b or 5k never becomes the prnmary.
Proof: Suppose, for contradiction, that b was initiated at time t', 5k does
not dequene b but becomes the primary at time ? k. From Figure 5.4, for all
i < k, Faulty?[sjj = true at time ? k. Therefore, from B4, B5 and Figure 5.4,
t k = lr + d for some 1 > 0. Furthermore, from B9, t k > [L;1 r + d. Therefore,
there exists a t > t' such that 1 = [?rtl However, this violates BlO for the time t
because FaultYk[551 = true at time t k = [rtlr + d and b has not been dequened
by time t k = [rt]r + d.			0
Lemma 5.2.8 If a response r is sent to the client, then there was a successful
broadcast of((iog,s5,r),j) for some 55.
63
Proof: Follows from Figure 5.4.
Theorem 18 The protocol in Figures 5.4 satisfies Pbl.
Proof: De?ne the predicate Prmy?? at time t to be: (s5 has not halted by
time t) A (Vk < j : Faultyj[s?] = true at time t). Note that Prmy?? can be lo-
cally determined by 55. Consider Prmy?? and Prmy?1, where i ? j. Now, Prmy??
? (Faultyj[si] = true) and from B6, 5j has halted. Thus Frmy?? ? ?Prmy??. E
Theorem 19 The protocol in Figures 5.4 satisfies Pb2.
Proof: Follows from Figure 5.2.
Theorem 20 The protocol in Figures 5.4 satisfies Pb3.
c
Proof: A client only sends to a server that has executed inform-clients and a
server 55 executes inform--Hclients only after it becomes the primary. E
Lemma 5.2.9 Let 55 be the first server that does a successful broadcast after ? i
Let 5?k be the kth server (ties broken arbitrarily) to halt between ? i and ? j such
thati<??<j. Then, ??k?<?i+k(r+d+C).
Proof: ?Ve prove the result by contradiction. Let ?iv be the first server that
violates the lemma i.e. Vu : 0 ? u < v : i? < ? i + u(r + d + C) and
? ? i + v(r + d + C). Let 5ji be a non--Hhalted server with the smallest id
64
at time ? ?v?l such that i ? i' < I (there is such a server, in particular the server
that halts at time ? iv). By B7 and Figure 5.4, s?1 must become the primary by time
t = ? iv?i+r+d < I i+(v--H1)(r+d+C)+r+d = I i+v(r+d+C)--HC < I iv?C.
Since ?? # j, s?i does not initiate a successful broadcast. Hence, by B3 s?? must
halt by time t + C I ?v, which is a contradiction since the ?th server does not
halt before time I iv. E
Lemma 5.2.10 Let sj be the first primary that initiates a successful broadcast
afler Ii. Then ? j < li + (I --H i)(r + d) + (j --H i --H 1)C.
Proof: From Figure 5.4, all servers 5h, h < j must have halted by time ?
Let 5 = fsu: i < u <j A 5? halts between I i and tit and let 5 = k. Let ?v
be the last server to halt in 5 (with ties broken arbitrarily). From Lemma 5.2.9,
?v must halt by time I v I i + (I --H i --H 1)(T + d + C). Therefore, by B7, s?
must become the primary by time I i + (I --H i --H 1)(r + d + C) + ? + d or by time
Ii+(j--Hi)(r+d)+(5--Hi--H1)C.			c
Lemma 5.2.11 Let m be a request that is sent to s? such that m does not have a
response. Then rn must have been sent at time t > max(t I, I I --H C --H p) and if
received, was received in the interval [max(t I, I I --H C), I I]
Proof: Clearly, m couldn't have been sent before t I because sj executes
inform--Hclients only after becoming the primary. Suppose that m was sent at
time t, where t I < t < I I --H C --H p. By Lemma 5.2.3, m must have been re-
ceived by time t + p and by Lemma 5.2.1 responded to by time t + p + C < I I, a
contradiction, since m does not have a response.
65
Similarly, suppose that m was received at time t, where ? j ? t < ? j --H
Again, by Lemma 5.2.1 a response to m would be sent by time t + p + C ? ?j, a
contradiction.			ED
Let x(a[tl) denote the final value of variable x in the run a[t] where a[t] is the
prefix of a through time t. ?Vhen a is clear from context, we will abbreviate x(a[t])
as
Define the following over finite sequences:
a E b=--H a is a prefix of bA Ibi --H lal ? 1.
a E b= a is aprefixofbA Ibi --H Ia = 1.
prefix(b) = a if IbI > 1 and a E b. pre:'ix(b) is undefined otherwise. Therefore,
prei'ix(b) is b with the last value removed.
Define the following over some run a, possibly infinite.
sI,(sj) = (si initiated a successful broadcast).
prevst(s?) = 5k, h < i if sf(sh) and there does not exist a ji, h < i' < i such that
st(sji). prevs?(sj) is undefined otherwise. Intuitivdy prevsf(sj) is the last
server before Sj that initiated a successful broadcast.
Pred(si) = (s?: h <i and 5h becomes the pnm&y?.
Succ(sj) = fsj: j > i and Sj becomes the pfl'mary?.
Lemma 5.2.12 Let st(sj) and Sj = prevsf(s5). Then no broadcast initiated by
5h E Pred(s?) is dequetted afler t
66
Proof: By Property B9 and Figure 5.4, t j > [Ail7 + d.			The lemma now
follows from Property B2.			E
Lemma 5.2.13 Let sf(sj), 5j = prevs:'(sj) and 5k E Succ(sj). Then at most
one broadcast b = ((log, 5j, last(Fina4')), i) by 5j can be dequeued by Sk at time
t> ti
Proof: Let 5j initiate its last successful broadcast at time ti. From B9 and
Figure 5.4, $ j > r + d and so from B2, no successful broadcasts by Sj can
be dequeued after time $ j. The first broadcast that sj initiates is successful, so
by BO, the only broadcasts by 5j that can be dequeued after $ j are of the form
((log, 5j, response)), i). From Figure 5.4, Sj can make at most one unsuccessful
broadcast of the form ((log, 5j, last(Finali)), i). The lemma now follows from
BO.			c
Let Jnit? = statej($ i) and Fina1? = statej(? i). In the following lemma, if 5j
is undefined, then let $ i = ? i = 0 and Jnit? = Finalj =
Lemma 5.2.14 Let sf(sj), 5j = prevsf(sj) and 5k ? Succ(sj) u fsj?. Then
1. state?($ j) E Fina4.
2. If there exists a broadcast initiated bys? that is dequened bys? at time t > $ j,
then state?($ 1) E Final?.
3. If there exists a broadcast b initiated by 5h ? Pred(sj) \ fsij that is de-
queued by s? at time t > $ j, then b = (mylastlog, 5h, last(Fina1?)) or
b = (mylastlog, 5h, last(pref ix(Fina4))).
67
4. If Initj E Final?, then the response last(Finali) was not sent to the client.
.j. If Final? = Initi, then mit1 = Initi
Proof: Let n = : s?' E Pred(sj) A Sf(sii)tI. We prove the result by
induction on n.
Base Case: n = 0. In this case, there is no server 5j and so Initi = Final? =
At time 0, all states are e. Since n = 0, all servers in Pred(sj) must have halted
during their first broadcast and so all broadcasts initiated before t j must be of
the form ((mylastlog, --H, e), --H). From Figure 5.4, this satisfies (1) (2), (3), (4),
(5).
Induction Case: n > 0 and we assume that the lemma is true when 5j became
the primary. Let b' be the last broadcast by 5h ? Pred(s?) to be dequeued after
$ i. If n > 1, then let Final??1 = F??ulprevs?(si)? else let Final??1 =
Observation 1: We now show that if s? E Succ(sj) dequeues b', then statej E
Fina4?1 after s? dequeues b' and updates its state. If n = 1, then b' is of the form
((mylastiog, 5h, e), h) and the argument is similar to the Base case. So assume
that n > 1. If h = i --H 1, then the result follows from Figure 5.4, Lemma 5.2.13
and the induction hypothesis for (2). On the other hand, if h $ i --H 1, then the
result follows from Figure 5.4 and the induction hypothesis of (1) and (3).
Consider now the first (successful) broadcast b = ((mylastlog, 5j, r), i) that 5j
makes at time $ i. By Lemma 5.2.7 and B2, any server s? ? Succ(sj) must dequeue
b by time [L$lr+d.
Observation 2: We next show that state1 = Initi after si dequeues b and
updates its state. By Lemma 5.2.6, b cannot be dequened before b'. However, from
68
Observation 1, state1 E F?nat??1 after b' is dequeued. The result now follows from
Figure 5.4 and the induction hypothesis for (1).
Let F?na1? = Initiori...rN (the sequence r1 . . . r? could be empty). Let i'I =
((log, 5j, r1),i)... ((log, sj, ry), i). Let the sequence r = ..... r? be those requests
that have been sent to clients, and let R = ((log, Sj, ri), i)... ((log, 5j, rs), i).
From Figure 5.4 R E Al and from Lemma 5.2.8, all broadcasts R must have been
successful.
Assume that si+1 E Succ(si) is any server that becomes the primary after Sj
and before si. (the argument is similar if no such server exists). By assumption,
si+1 does not initiate a successful broadcast. Also, if the last broadcast in R was
initiated at time tR by Sj, then by Property B9, t (i + l) > [t?]T + d. Therefore,
again by Lemma5.2.7, BO Bi and B2, boR must have been dequeued by sj+?, sj and
E Succ(s1) by time t (i + 1). From Observation 2, dequeueing li makes state??=
Initi where ii E (? + l,j,lJ. Therefore, state?i(t (i + 1)) = Ifliti 0 r? . . . ?y--H? or
stateii(t (i + 1)) = In?tj ...... ry. (2) now follows, because from Lemma 5.2.6, if
the last broadcast by Sj is dequened after ? j, then any broadcast by Succ(si) can
only be dequeued after ?
When ?i+1 becomes the primary, it can initiate only one broadcast (by as-
sumption) of &` = (mylastlog, Sj+?, last(statei+1($ (i + 1)))) which now along
with Lemma 5.2.12 satisfies (3). If 5?1, E fi, 1) ever delivers b' at some time
t < ? j, we still have state?t(t) = In?tj 0 r1 ... rN?1 or statej(t) = Initi 0 r1 ...
(from Figure 5.4). This satisfies (1) and (5).
(4) is satisfied for the following reason. From the argument in the previous
paragraph, In?tj E Fina1? only if either sj+? or s? did not dequene last(Finatj),
69
which implies that the broadcast of ((log, 5j, last(Finai?)), i) must have been Un-
successful. The result now follows from Lemma 5.2.8.			E
In the following lemma, we will construct a run ap of the bofo server P and
a set of corresponding failure intervals. For brevity, we will refer to the jth failure
interval that we construd in ap as ff(ap, i). Recall that a message sent by p?
to pj gets enqueued (if it ever gets enqueued) in the message queue queue?4 at
process pj.
Lemma 5.2.15 Let a be any run of the primary backup system in which si(sj)
holds. Then
1. there is a finite run ap of a (p, 5, r + d + C + p + ?) bofo server P such that
for all clients k : ap k a[$ j?
2. tj?ff(ap,j) andjff(ap,j)i<r+d+C
3. statep(ap[t 5]) = Initj
I For all clients k : queue?,p(ap[? 5]) =
5. Response r is sent to client k in ap iff r is sent to client k in a[t 51 Further-
more, if r is sent at time t in ap, then r is sent at time t' in a[t 51? where
t <ti <t + c.
Proof: Let n = If?i' : s?' Ei Pred(sj) A s?(?i')tI We prove the result by
induction on n.
Base Case: n = 0. sj is the first primary that initiates a successful broadcast
at time t 5. If 5 = 0, then $ 5 = 0 and the lemma trivially holds.
70
If 5 $ 0, then by B3, ?o must have halted by time C and so ? 0 < C. Since
5 > 1, from Lemma5.2.lO, t 5 <--H ? 0+5(T+d)+U--H1)C <--H C+5(r+d)+(5--H1)C =
5(r + d + C). Divide the time from 0 to ? 5 into 5 failure intervals, each of size
at most T + d + C. Construct the run ap until ? 5 such that the same requests
are sent in ap and a[$ 5]. All requests sent in the interval [o..t 5 --H p] are never
enqueued in ap and all requests sent in (t 5 --H $51 haven't yet been enqueued.
Thus (4) is satisfied because all queues are empty and (5) is satisfied because no
responses are sent until time $ 5 in either a or ap. All outages corresponding to
these requests will lie in one of the above defined failure intervals. Thus (1) and (2)
are satisfied. Also, from the above construction and Figure 5.1, statep($ 5) =
Since by Lemma 5.2.14, Initj = ?, we have satisfied (3) as well,
Induction Case: n > 0. By the induction hypothesis, the lemma holds
through time $ i where Sj = prevsf(sj). Therefore, there exists a finite run a?1
corresponding to a[$ i]. ?`e now extend the run a1? to ap.
In order to simplify notation, we divide the proof into two cases.
Case 1: ?i--H$i<??+p.
1. The jth failure interval is extended by [$ ?..? i]. From Lemma 5.2.10, $ 5 --H
Ii < (5--Hi)(r+d)+(5--H?--H1)C <--H (5--Hi)(r+d+C). Divide the time from
I i to $ 5 into 5 --H i failure intervals, each of size at most T + d + C.
2. Construct the run ap until $ 5 such that the same requests are sent in ap
and a[$ 5]. All client requests sent in [mar(0, $ i --H p)..$ 5--Hp] to Sj1,?1 <
that are not enqueued in ap' are never enquened in ap, and all requests sent
in ($5 --H p..$ 5] to s?', < ? that are not enqueued in a'? haven't yet been
enquened in ap. We can do since all outages will lie in one of the failure
71
intervals constructed in step 1.
Define Al such that Fina4 = Initjoill. From Lemma5.2.14 Initj = InitiQAf
or Initj = Jnitj 0 R, where R E i'J. Let ?\4 be the requests corresponding
to the responses in Al
\&`e next construct ?p so that the same responses are received by the clients in
both ? and apin the interval [t i..t j]. We do so as follows. Let rn ? ?\4, and
let rbe a response to m such that ris sent to the client in a. Construct ap so
that mis received at the same times in both aand ap. We can do this because
from Lemma 5.2.3, m is received in a within p of being sent. ?`e can further
ensure that response r is sent in ap as well, because statep(ap'[t i]) = Initi,
for all clients k : queue??p(ap'[$ ?1) = = ?U?U?k,sj($ i) (both from the
induction hypotliesis) and, as argued above, no requests sent to s?', i' < i will
be enqueued after ? i in ap. Thus both Sj and P receive the same sequence
of reques?s, at the same times, and so go through the same state changes. In
ap, the response r is sent at the same time mis received (note from Figure 5.1
that this has to be true of all runs of P) whereas in a[t i], r might be sent as
much as C time units later by Sj. This satisfies (5). However, we construct
ap so that r is received at the same time by the client in a and ap. We
can do this because r is received in a by the client within C + ? after m is
received, and p > C + ?. Finally, if a response sent before ? i is received by
a client after ? i in a, then that response is received in ap as well. This can
be done from the induction hypothesis for (5).
We now finish the construction of ap. First assume that M # ?. If Initj =
Initi o R, then assume that no further requests are enqueued in ap after
72
the request corresponding to last(R) is enqueued. ?`e can do this because
from Lemma 5.2.14, last(Ai) wasn't sent to the client. The corresponding
outages will lie in the failure intervals defined in step 1. Thus we have
?tatep(ap[t ji) = statep(a'p[? i]) 0 R = Jnitj satisfying (3). (1) and (2)
also follow from the construction. If Initj = Jnitj 0 Al, then again assume
that no request is enqueued after the request corresponding to last(i'l) (say
m'). If ?? does not have a response in a, then construct ap so that P receives
m', changes state but omits to respond to ?? \Ve can do this because this
outage will fall in the failure interval j.
If Al = c, then (1), (2), (3) and (5) can similarly be shown by assuming that
no requests are enqueued in ap in the interval [t i..t ji.
3. Finally, we show that the queues are empty. The queues are empty at time
? z. By the above construction of ap, no requests sent to s?', i' < i are
enqueued after $ i in ap. Furthermore, the requests sent after time $ i to
5 i < ?` <j, other than m' in step 2, that are received in a[$ ji are received
in ap as well. Also, by construction in step 2, ?? is not enqueued at P
at time $ j. Finally, again by construction, all requests sent in [$ i..$ j] to
si', i < j? < j that are not received in a[$ j' are never enqueued or haven't
yet been enqueued in ap. Thus Vk: queue?,p(ap[$ j]) =
Case 2: ??--H$i>?+p.
1. Extend the ith failure interval by [$ i..$ ? t ? + p]. Let k = max($ i + ? ? i --H
O --H C). Now since $ j --H (k + p) < $ j --H + C < (j --H i)(r + d + C) (from
Lemma5.2.lO), divide the interval [k+0..$ il intoj--H? failure intervals, each
of size at most r + d + C
73
2. From the implementation of inform--Hclients, no request can be sent to server
5j1, < i after time ? i + ?. Therefore, construct ap so that all requests sent
in [max(O, t i --H p)..? + ? to 5j'? ?1 < that haven't yet been enqueued by
? i are never enqueued in ap. The outages will lie in the failure interval i.
3. Similar to Case 1, we can show that the same responses are sent in a and
ap. The only difference is the following. ? in step 1, k = ? i --H p --H C, then
by Lemma 5.2.11, the outage corresponding to the request rn' in Case 1, will
lie in one of the above j --H i intervals. On the other hand, if k = $ i + ?, then
the outage will lie either in interval i, or one of the j --H i intervals.
4. Similar to Case 1, we can show that the queues are empty.
E
Lemma 5.2.16 For any run apB of the primary-backup system there exists a
run ap of the (p, f, T + d + C + p + ?) bofo server P such that for all clients
ap k apB.
Proof Outline: Consider the run apB of the primary-backup system. Let i be
such that sf(sj) and no server s?, j > i ever makes a successful broadcast in aPB
Since by Lemma 5.2.5, the number of servers that halt are bounded by f, such an
i has to exist and i < f. Therefore, from Lemma 5.2.15, there is a run ap? of a
(p, f, r + d + C + p + ?) bofo server such that for all clients k : &p ?? ap?[t i]. It
is then easy to extend a?' to ap, by an argument similar to one in Lemma5.2.15. E
Theorem 21 The protocol in Figures ?.l satisfies Pb4.
74
Proof: Follows directly from Lemma 5.2.16
Theorem 22 The protocol in Figures 5.4 satisfies PbS.
Proof: From Lemma 5.2.5, a correct server never halts. The result now follows
because a correct primary always remains the primary.			c
We now prove a result about the failover time of the system.
Lemma 5.2.17 For any time t, if at most F servers have halted by time t, then
for all t1,t2:t1 < t? ? t such that t2 --H ti > F(d + r), there is a primary in the
interval [t1..t2j.
Proof: We prove the result by induction on F
Base Case: F = 0. Trivial, since from Figure 5.4, so will always be the
primary.
Induction Case: F > 1. Assume that the lemma holds as long as at most
F --H 1 servers halt. Consider any run a. If less than F servers halt in a, then
the lemma follows from the induction hypothesis. Therefore, assume that at least
F servers halt in a. Let the Fth server halt at time t' and let this server be 5?.
If t' > t, then the lemma follows from the induction hypothesis. So assume that
t'<t. Lett'2=ti+(F--Hl)(d+r). Notethatt'2+d+r<t2.
If t'2 < t', then by the induction hypothesis, there is a primary in the interval
[t1..t'2]. Since [t1..t'2] is a prefix of [t1..t2j, there is a primary in the interval [t1..t2?
as well.
On the other hand, if ti2 > t', then let sj be a non--Hhalted server with the
smallest id at time t'. There will exist such a server because by Lemma 5.2.5,
fls > f > F. Sincet' <t'2 and ti2+d+7 < t2, we must have that t' +d+r < t2.
Also, since at most F failures occur until time t, s5 cannot halt through time
t2 < t, which implies that sj cannot halt through time t' + d + 7. Thus, from
B7, Vi : Faultyj[sij = true by time t' + 7 + d and from Figure 5.4, sj will be the
primary by time t? + d + 7 ? t'9 + d + ? = t1 + F(d + 7) < t2. Hence there is a
primary in the interval [t1..t2].			E
5.2.2 Message losses between the clients and the servers
We now consider the case when there can be message lost between the clients
and the servers due to both failures of the links between clients and the servers,
and due to omission failures of the servers. The primary procedure in Figure 5.4
will not tolerate such failures. For example, a non--Hfaulty primary might omit to
receive all requests from a client, thereby violating Pb4. Similarly, inform--Hclients
might omit to inform some of the clients. However, it is relatively easy to mask such
failures when clients are non-faulty.' Assume that there is an upper bound (say 0)
between any two requests from a client and that requests carry sequence numbers.
If the primary does not receive any requests from a client during an interval of
length 0 or if the primary receives some request with a sequence number gap, then
the primary halts. Similarly, the primary can detect that a response was lost by
having clients acknowledge responses. If such an acknowledgement is not received,
then again the primary halts. Properties Pbl-Pb5 can again be shown to be true
if we make the above modification in Figures 5.2 and 5.4.
`The protocols described in this paper can also be extended to tolerate crash failures of clients.
We do not discuss these extensions in this paper.
76
5.3 Implementation for the various failure
models
In this section, we show how to implement the procedures initialize, broadcast1
delivery--Hprocess and failure--Hdetector for the various failure models. For each
set, we also show these procedures satisfy BO?Bll.
5.3.1 Crash Failures
The five procedures for crash failures are given in Figure 5.5. The notation A =? B
defines the symbol A to be the same as the symbol B. ?Vhenever we say that a
server "delivered m", we mean that the procedure deliver has been called with
parameter M having the value m. The procedure Enq adds an element to the
head of a queue and the procedure Deq dequeues an element from the tail of the
queue. Furthermore, in properties BO--HBli, d = 6 and C = 0.
Lemma 5.3.1 The procedures in Figure 5.5 satisfies BO-Bli
Proof:
BO: From Figure 5.5, any message that is enqueued in deliver must have been
broadcast earlier. Furthermore, each message can be enqueued (and hence
dequeued) at most once because each message can be received at most once.
Bi: From FIFO properties of the communications links and the Enq and Deq
operations on Rqueue?.
B2: From the fact that message delivery takes at most 6 and none of the pr?cedures
are blocking.
77
procedure initialize(k)
state?			Rqueue?			e
Vi : Faufty?[si] false
procedure broadcast(??, k)
send 1W to all servers
procedure deliver (M, k)
Let NI be of the form (tag, --H, --H)
if tag E flog,mylastlogj then Enq(Rqueue?, (NJ, k))
procedure delivery--Hprocess(k)
do forever
when received iW: deliver(iW, k)
od
procedure failure--Hdetector(k)
Aj? =? (alive,sj,1r)
cobegin
for 1			0 to oo
when current-time = ir: send A1k to all servers
I for 1 0 to oc
when current--time = tr + d:
Vj : if not delivered A)1 then Faulty?[sj?			true
coend
Fi?ure 5.5: Procedures for crash failures
B3: From the implementation of broadcast.
B4: Follows from Figures 5.4 and 5.5, since they are set to false only in procedure
initialize.
B5: Follows from Figures 5.4 and 5.5 and the second when statement in the
failure--Hdetector.
B6: Suppose ?k sets Faulty?[sj1 = true at lr + d < t. This implies that 5? did not
send (alive, Sj, ir) to ?k and therefore 5? must have halted by time lr < t.
78
B7: Let (1 --H 1)7 < t < 17. Therefore there was no (alive, 5k, 17) message sent
and so from Figure 5.5, sj will set Faultyj[s?] = true by time 17 + d =
(1 --H 1)7 + 7 + d < t + 7 + d.
B8: Follows t'rom Figure 5.5 because the code does not contain any stop statement.
B9: Let 17 < t ? (1+ 1)7 for some 1 > --H1, which implies that = 1 + 1. By
assumption, 55 initiates a successful broadcast at time t and so must have
sent (alive,s5, 1'7) for all 1' < I which in turn were delivered at all non--Hhalted
servers by 1'7 + ?. Thus, from Figure 5.5 no server can set Faulty?[sj] = true
before time (l+1)7+?= %rt]7+6
BlO: If b was initiated at time t' < t then b must have been dequeued by all
non--Hhalted servers at time t' + ? ? t + ? < [?t]7 + 6.
Bil: Let 17 < t < (1 + 1)7 for some 1 > --H1, which implies that = 1 + 1.
By assumption, 55 initiates a broadcast at time t and so must have sent
(alive,s5, 1'7) for all 1' < 1 which in turn were delivered at 5k by l'7+?. Thus,
from Figure 5.5 5k cannot set Faulty?[sj] = true before time (1 + 1)7 + ?
El
Theorem 23 The lower bounds for crash failures shown in Table 4.1 are all ti9ht.
Proof: Properties BO--HB11 are met for n? > f which is optimal. The pro-
cedure broadcast does not block and so C = 0 which is optimal. Finally, since
Lemma 5.2.17 holds for all 7> 0, the failover time is also optimal because d = 6. 0
79
5.3.2 Crash+link failures
The procedures in Section 5.3.1 do not satisfv all of BO--Hl3ll when links are not
assumed to be failure--Hfree. For example, if 8? sends a message to 5k then the
message might not reach 5k due to a link failure, leading to a violation of B6 or BlO.
?Ve therefore replace the implementation of the procedures initialize, broadcast,
delivery--Hprocess and failure--Hdetector in Figure 5.5 with the implementation in
Figure 5.6. We define d = 2?, C = 0 and assume that n? > f+l. The procedures in
Figure 5.5 use the routines fifo--Hbroadcast and fifo--Hdeliver in Figure 5.7. In the
rest of the paper, whenever we say that a server "fifo--Hbroadcasts rn" ("fifo--Hdelivers
m"), then we mean that the procedure flfo--Hbroadcast (fifo--Hdeliver) was called
with the parameter m. The procedures fifo--Hbroadcast and fifo--Hdeliver convert
an intermittent link failures into a permanent link failure: if 5? fif?broadcasts a
message m to sk, then for every subsequent message m' that Sj fifo--Hbroadcasts to
5k, 5k will not fifo--Hdeliver m' unless it first fifo--Hdelivers m.
Lemma 5.3.2 If sj flfo-broadcasts m and then fifo--Hbroadcasts Tn', then 5k cannot
fifo-deliver m' before it fifo--Hdelivers m.
Proof:			Follows from Figure 5.7.			E
Lemma 5.3.3 If s? flfo-broadcasts m at time t and does not halt during the fifo-
broadcast, and f 5k does not fifo-deliver Tn, then either 5k has halted by time t + 6
or the link from sj to sk is faulty.
Proof: ?Ve assume that 5k has not halted by time t + 6 and show that the link
from sj to 5k must be faulty. Let m' be a message that sj fif?broadcasts at time
80
procedure initiaiize(k)
state?			1?quetiek			Dqueue?
Vi : Faultyk[si] false
last-sent?			Vj :expected?[j]			0
procedure broadcast(iN4, k)
time :=current--Htime
fifo?broadcast(Init, Al, 5k, time)
procedure deiivery-process(k)
cobegin
II fifo-deiivery-process(k)
do forever
(tag, jkI, --H, t) :=Deq(Dqueue?)
if tag = init then fifo-hroadcast (echo, iki, sk,t)
if tag = echo and not delivered M before then deliver (M, k)
od
coend
procedure failure--Hdetector(k)
A,' =A (aiive,s5,lr)
cobegin
for 1 := 0 to oc
when current-time = iT: fifo-broadcast(init, A1k, 8k, iT)
for i := 0 to cc
when current-time = ir + d:
Vj: if not delivered A,1 then Faulty?[sj] := true
coend
Fi?ure 5.6: Procedures for crash+link failures-i
81
t (m' could be m). Furthermore, all messages that were fifo--Hbroadcast before
m' (call this set ?\4) by sj have been fifo--Hdelivered by 5k by time t' + ?, whereas
m is not fifo--Hdelivered until time t' + ?. If it is the case that m is not received
by 5k by time t' + ?, then the link must be faulty, and we are done. On the other
hand, if m' is received, then m' must be received after ?? because the links are
FIFO. This, however, implies from the implementation of fifo--Hdelivery--Hprocess
that m' should have been fifo--Hdelivered, a contradiction.
procedure fifo-broadcast(tag, ?I, 5k,
send (tag, NI, 5k, t,last-sent?) to all
last-sent? :=last-sent? + 1
procedure fifo-deliver (tag, I?I, Sj, t)
Enq(Dqueue?, (tag, M, sj, t))
procedure fifo-delivery--Hprocess (k)
do forever
when received (tag, A', 5?, t, laatj):
if (lastj #expected?[j]) then skip
else
od
expected?[j]			expected?[ji + 1
fifo-deliver (tag, A', sj, t)
Fi ure 5.7: Procedures for crash+link failures-ii
Lemma 5.3.4 If s? fifo-broadeasts (init,A',sj, t) and does not halt during the
flfo-broadcast, then all servers that have not halted by time t + 2? fifo-deliver
(echo, A', --H,t) by that time.
Proof: Since sj does not halt during the flfo--Hbroadcast, Sj must have sent
(init, A', 5k, t) to all servers, and all servers 3j that fifo-delivered (init, A', 5k, t)
82
either sent (echo, ill, sj, t) by time t + 6 to all servers or halted. We show that
any server sk that does not halt by time t + 26 fifo-delivers at least one copy of
(echo, ?, --H, t) by that time. Consider the following n? --H 1 < f + 1 communication
paths: < 5j, 50, 5k >, ., < Sj, 5j--H1, 5k >, < 5j' ?j+1, 5k > o+,< 5i, 5ns--HI? 5k >
Since there can be at most f failures, at least one communication path (say
< sj, Sj, 5k >) must remain intact i.e. Sj has not halted by time t + 26 and
the link from aj to Sj and the link from sj to 5k are both non--Hfaulty. The lemma
now follows by applying Lemma 5.3.3 twice. E
Lemma 5.3.5 The procedures in Figures 5.5 and 5.6 satisfy BO-Bli.
Proof:
BO: Any message that is dequeued must have been broadcast earlier. The fact
that a message is dequened at most once can be easily seen from the imple-
mentation of delivery--Hprocess.
Bi: Suppose that 5j initiates the broadcast b' after it initiates the broadcast
b. By Figure 5.7, sj cannot fifo--Hbroadcast (init, &`, aj,--H) before it fifo--H
broadcasts (init, b, s?, --H). By Lemma 5.3.2, all servers Sj must fifo--Hdeliver
(init, b, 3?, --H) before it fifo--Hdelivers (init, b', 5?, --H). Hence, by Figure 5.6, Sj
does not fifo--Hbroadcast (echo, ??, Sj, --H) before it fifo--Hbroadcasts (echo, b, Sj,
--H). Therefore, by Lemma 5.3.2 all servers 5k must fifo--Hdeliver (echo, b, Sj? --H)
before it fifo--Hdelivers (echo, b', Sj, --H). The 1emma now follows from the FIFO
properties of Enq and Deq.
83
B2: Suppose that a server sk dequeues b. Therefore, some server ?? must have
fifo--Hbroadcasted (echo,b,Sj,--H) by time t + 6 and sk must have fifo--Hdelivered
this by time t + 26 and therefore from Figure 5.6 delivered b by that time.
The result now follows because sk dequeues 6 as soon as it is delivered.
B3: From the implementation of broadcast.
B4: Follows from Figures 5.4 and 5.6, since they are set to false only in procedure
initialize.
B5: Follows from Figures 5.4 and 5.6 and the second when statement in the
failure--Hdetector.
B6: Suppose Fau1tyi[s?] is true for the first time at t = iT + 26. This, therefore
implies from Figure 5.6 that ?? did not deliver Atk and therefore did not
fifo--Hdeliver (echo,A?,--H,ir) by time iT + 26. By Lemma 5.3.4, sk must have
halted by time ir.
B7: Let (i--H1)r <t < ir. Therefore there was no fifo--Hbroadcast of (init,A'k,ak, ir)
and so from Figure 5.6, a5 will set Fauitm[s?] = true by time ir + 26 =
(1 --H l)r + r + 26 < t + r + 26.
B8: Follows from Figure 5.6 because the code does not contain any stop statement.
B9: Let ir < t <(i + l)r for some 1>--Hi which implies that [?rtl =1+1
sj must have fifo-broadcasted (init,A3?,s&,i'r) for all 1' < 1,. From Lemma
5.3.4, s? will fifo-deliver (echo,Aj'1,Sk, i'r) by i1r + 26 and so deliver A,'1.
Therefore, s? cannot set FaultYk[8ii to true before time (1 + l)r + 26 =
[rtlr+26
84
BlO: From B9, Faulty?[sj] cannot become true before time t + 2?. If on the other
hand, Fautty?[sjl does become true after this time, then using Lemma 5.3.4
and B2, 5k must have dequeued b by this time.
Bit: Let tr < t < (l+1)rforsomel> --Hl which implies that [fi = 1+1 5? must
have flfo-broadcasted (init, Aj1I, 5k, l'r) for all 1' < 1'. From Lemma 5.3.4, 5k
will fifo-deliver (echo, Aj1I, 5k, t'T) by l?7 + 2? and so deliv& A,11. Therefore,
5k cannot set Fautty?[sj] to true before time (1+ 1)r + 2? = [rtlr + 2?.
E
Theorem 24 The tower bounds for crash+linkfadures shown in Tabte 4.1 are ati
tight.
Proof: Properties BO--HBil are met for n3 > f which is optimal. The procedure
broadcast does not block and so C = 0 which is optimal. Finally, since Lemma
5.2.17 holds for all r > 0, the failover time is also optimal because d = 2?. 0
5.3.3 Send--HOmission failures
The procedures of Figure 5.5 do not tolerate send?mission failures. Procedures
for send?mission failures are given in Figures 5.8 and 5.9, used in conjunction
with delivery--Hprocess of Figure 5.6. These procedures use a technique similar to
the translation technique described in [NTS8]. However, we do not use the round
model of computation, and our technique is adapted to work with a separate failure
detector. Informally, this technique ensures that if a server omits to send a message,
then it detects its own failure and halts. This is achieved by making each server
Sj broadcast every r a message saying that it is alive. If some server Sj does not
receive this message within ?, then Sj must have committed a send?omission failure.
Then 5? broadcasts this fact, and when Sj learns of this, it crashes itself. Thus,
if a correct server observes the send?omission failure, then Sj will stop, effectively
translating the send?mission failure into a crash failure.
For these procedures, we will define d = 2? and C = T + 2? and assume that
?s > f
procedure initialize(k)
state?			Rqueue?			Dqueue?			e
current--Hprimary:=O
Vi : Fautty?[sj] false
tast-sent?			Vj :expectedj			0
procedure broadcast(M, k)
time cttrrent--Htime
if [t??r?]T = time then t := time + r
else t := time
fifo-broadcast( init, INI, s?,time)
wait until current-time= T + d
procedure deliver (i?, k)
Let M = (tag, si, --H)
if tag ? f1og,my1ast1og? then
if I <current--Hprimary then return
else
current--Hprimary: J
Enq(Rqueue?, M)
Fi ure 5.8: Procedures for send?mission failures-i
As can be seen, the procedure broadcast always blocks at least for time 2?, even
S6
procedure failure.det?ector(k)
=? (alive,sj,lr)
= (?au1t,sj,lT)
cobegin
for 1			0 to oc
when current--time = IT:
fifo-broadcast(InIt, A'k, 3k, iT)
I fort			0 to oo
when current--Htime = lT + ?:
Vj : if not fifo--Hdelivered (init, A,1-, sj, lr) then
fifo-broadcast (echo, i$I?, 5k,tT)
I for 1 =0 to oc
when current--time = lT + 2?:
if fifo--Hdelivered (echo, Fk1, --H,iT) then stop
if ?j: not delivered A,?- then Fau1ty?[aj? := true
coend
Fi?ure 5.9: Procedures for send?omission failures-ii
if there are no failures. However, this can be optimized so that the blocking occurs
only for a round trip message delay (which can be less than 2?) in the absence of
failures. We do not show this optimization in this paper.
Lemmas 5.3.2 continues to hold for send?mission failures.
Lemma 5.3.6 Jf a correct server fifo-broadcasts m at time t, then all servers
that haven't halted by time t + ? fifo-deliver m.
Proof: In this failure model, there are no receive?mission failures. The lemma
now follows from Figure 5.7 because no non--Hhalted server can omit to receive mes-
sages.			E]
Lemma 5.3.7 The procedures in Figures 5.8 and 5.9 satisfy properties BO--HBli
Proof:
87
BO: Same as crash+link failures.
Bi: Suppose that sj initiates the broadcast b' after it initiates the broadcast b. By
Figure 5.7, Sj cannot flfo--Hbroadcast (init, b', s?, --H) before it fifo--Hbroadcasts
(init, b, Sj, ). By Lemma 5.3.2, a server Sj must fifo--Hdeliver (init, b, ??, --H)
before it fifo--Hdelivers (init, 6', s?, --H). Hence, by Figure 5.6, Sj does not fifo--H
broadcast (echo, 6', Sj, --H) before it fifo--Hbroadcasts (echo, 6, Sj, ). Therefore,
by Lemma 5.3.2, a server s? must fifo--Hdeliver (echo, 6, Sj, --H) before it fifo--H
delivers (echo, 6', s?, --H). Now, if sk enqueues ?? in the queue Rqueue?, it
must have also enqueued 6 earlier because current--Hprimary never decreases.
The lemma now follows from the FIFO properties of Enq and Deq.
B2: Same as the proof for crash+link failures.
B3: Follows from the implementation of broadcast in Figure 5.8.
B4: Follows from Figures 5.4 and 5.9 because it can set to false only in initialize.
B5: Follows from Figures 5.4 and 5.9 from the third when statement in the
failure--Hdetector.
B6: By Lemma 5.3.6 all servers will deliver (alive, Sj, ir) for all t if s? is cor-
rect and so can never set Fau1ty?[sj] = true. So assume that Sj is faulty
and a server sk does not deliver (alive, sj, U), where tr + 2? < t. This
now implies (again from Lemma 5.3.6) that no correct server (say q) would
have fifo-delivered (init, (alive, si, ir), 5j, ir) and therefore, would have fifo-
broadcasted (echo, (?ault, 5j, ir), q, tr). Server sj would have received this
message and therefore have halted by time ir + 2? <t.
137: Same as the proof for crash+link failures.
13S: Let us assume, for contradiction, that sk is the first correct server to halt at
time t = 1r+2? by executing a stop in Figure 5.9. sk must have fifo-delivered
(echo, Fkt, si, iT) from some server Sj at time ir+26. However, since sk hadn't
halted by time ir, if would have fifo-broadcasted (init, A'k, s?, ir) and from
Lemma S.3.6,s? would have fifo-delivered this by time ir + ? and so couldn't
have fifo-broadcasted (echo, Fk?, Sj, ir), a contradiction.
139: Assume that the property is false for a successful broadcast b and sk sets
Fauityj[si] = true first at time ir + 2?. This implies that ?k did not deliver
(alive, si, ir), which implies that neither sk nor any correct server would
have fifo-delivered (init, (alive, Sj, ir), Sj, ir) (by Lemma5.3.6). Now by as-
sumption, Sj must have fifo-broadcasted (init, b, s?, t) after fifo-broadcasting
(init,(alive, Sj, ir
), si, ir) and (by Lemma 5.3.2) neither sk nor any correct server can fifo-
deliver (init, b, sj,t). If [--H??r = t then let t' = t + r, else let t' = t. Since
no correct server (say q) fifo-delivered (init, b, Sj, t), q will not fifo-deliver
(init,(alive,??, [L?]r),sj, [L?]r) (by Lemma5.3.2). This, however, implies
that ?j would have halted by time r + 2? (Figure 5.9), a contradiction
since I) is a successful broadcast.
1310: Let b be initiated at time t6 ? t. From Figure 5.4 and 139, no broadcast by
51,1 > j can be initiated before time t? + 2? and so can only be delivered
after tb + 2?. Therefore, from the implementation of deliver in Figure 5.8,
1310 holds if we can show that 5k fifo-delivers (echo, b, --H, tb) by time tb + 2?.
89
If s? is correct, then this is trivially true (by Lemma 5.3.6) because all servers
will fifo-deliver (init, b, 5?, tb) by time tb + ?. So assume that Sj is faulty. If
--H tb then let t' = tb + r, else let t? = t6. Thus, sj waited until time
T + 26 and did not halt at this time. This, however, implies that a correct
server (say q), would have fifo-received (init,(a1ive,sj,[L?]7),sj,[L?]r),
which then implies that q would have fifo-delivered (init, b, 5?, t6) and fifo-
broadcasted (echo, b, q, t6) by time tb + 6. Therefore, from Lemma 5.3.6, all
non--Hhalted servers would have fifo-delivered (echo, b, q, tb) by time tb + 26.
Bil: Follows trivially from the implementation of deliver in Figure 5.8.
0
Theorem 25 The lower bounds for send-omission failures shown in Table 4.1 are
all tight.
Proof: Properties BO--HBil are met for n? > f which is optimal. From the
procedure broadcast, C = r + 26 for any T > 0, which is optimal. Finally, since
Lemma 5.2.17 holds for all T > 0, the failover time is also optimal because d = 26.
0
5.3.4 Receiv?Omission failures
The protocol for tolerating receive?mission failures is obtained by taking the pro-
tocol for send?mission failures and replacing broadcast and failur?detector
with the procedures in Figure 5.10. As with send?mission failures, these proce-
90
dures were obtained using a technique similar to that described in [NT88]. Infor-
mally, a server Sj broadcasts every T a message saying that it is alive. Any server
that receives this message forwards it to all other servers. A server sk keeps track
of the number of forwarded `Jive" messages in an array witness?[i?. If sk does
not receive enough forwarded copies of the "alive" message within 2?, then s? must
have committed at least one receive?omission failure. Server sk detects this fact
and crashes itself, effectively translating the receive--Homission failure into a crash
failure.
For these procedures, we define d = 26, C = 0 and assume that fls > 2f.
procedure broa?dcast(iNI, k)
time			current--Htime
fifo-broadcast(init, :?, s?,time)
procedure failure-detector(k)
A (a1ive,s?,1r)
FJ A (fault, Sj, ir)
cobegin
for 1 := 0 to cc'
when current-time = 1r: fifo-broadcast(init, A'k, sk,tT)
I for 1 := 0 to cc'
when current-time = ir + 6:
Vj : if not fifo--Hdelivered (init, A,1, a?, ir) then
fifo-broadcast (echo, FJ, sk, ir)
for 1 := 0 to cc'
when current-time = 1r + d:
w?tness?[k] := (s? fifo--Hdelivered (echo,A?k, si, 1r)1
Vj # k: witnes??[5? := fs? fifo--Hdelivered (echo,A,',s?,1r) or
fifo--Hdelivered (echo,1$'.,Sj,lr)t
if 95: Iwitness?[j1I < ns --H f then stop
if 95: not delivered A,1 then Fau1ty?[sj] := true
coend
Fi?ure 5.10: Procedi:			f?			r?1i
-Li:?? jr receive?ir???sion ??ures
91
Lemma 5.3.6 no longer holds because a faulty server might omit to receive by
receive?mission. ?Ve, therefore, strengthen it as follows.
Lemma 5.3.8 If a serverfifo-broadcasts m at time t and does not halt, then all
correct servers fifo-deliver m by time t + 6.
Proof Outline: From Figure 5.7 and the facts that there are only receive--H
omission failures and the channels are FIFO, a correct server can skip messages
only if it halts. However, B8 proved below shows that a correct server can never
halt.			E
Property B8 will be proved below by using only the following weak form of
Lemma 5.3.8.
Lemma 5.3.9 If a correct server flfo-broadcasts m at time t and does not halt,
then alt correct servers that have not hatted by time t + 6 fifo-deliver m by time
t+6.
Proof Outline: From Figure 5.7 and the fact that the channels are FIFO. c
Lemma 5.3.10 The procedures in Figure 5.10 satisfy properties BO--HBli.
Proof:
BO: Same as send?mission failures.
Bi: Same as the proof for send?mission failures.
B2: Same as the proof for crash+link failures.
92
B3: From the implementation of broadcast since the maximum time blocking can
occur is T + 26.
B4: Follows from Figures 5.4 and 5.10 because it can become false only in initial-
ize.
B5: Follows from Figures 5.4 and 5.10 and the third when statement in the
failure--Hdetector.
136: Assume that sk has not halted by time t. ?Ve prove that Fauttw[s?] is false at
time t. Thus, for all 1: tr < t, sk has fifo-broadcast (init, (alive, sk, 1r), --H,
--H) at time iT. From Lemma5.3.8, all correct servers fifo-deliver (init, (alive
sk, iT), --H, --H) by time ir + 6. Thus no more than f servers fifo-broadcast
(echo, ( fault, sk, ir), --H, --H). Since 2f < n3, f ? --H f and so no server
fifo-delivers sufficient (echo, (fault, sk, ir), --H, --H) messages to set Fauityj[s?]
= true before time t + 26.
B7: Same as the proof for crash+link failures.
138: Let us assume, for contradiction, that sk is the first correct server to halt at
time t = ir+26 by executing a stop in Figure 5.10. Therefore, witness?[5i ?
--H f for some
Case j = k: Since ?k did not halt at time ir, it must have fifo-broadcasted
(init, A?, --H, ir) at time ir. Since no correct server has halted by time ir+6,
by Lemma 5.3.9, all correct servers will fifo-deliver this message by time ir+6.
Since there are ?s --H f correct servers, all these servers will fifo-broadcast
(echo, Ak', --H, ir). Server sk can execute a stop statement at time ir + 26
93
only if it did not fifo-deliver ?s --H f copies of (echo, A'k, --H, Ir). However, since
--H f copies of (echo, A?k, --H, [r) were sent and s? is correct, from Figure 5.7,
sk can omit to fifo-deliver any one of these copies (say from server si) only if
it had omitted to fifo-deliver an earlier message from Sj. However, this cannot
happen since sk is correct and the channels are FIFO. Thus sk cannot execute
the stop, a contradiction.
Case j ? k. Similar to the previous case because each correct server woUd
have fifo-broadcasted (echo, Aj1, --H, ir) or (echo, FJ, --H,lr) by time tr + ?,
and s? would have fifo-delivered this. Since there are at least n3 --H f correct
servers, sk cannot halt.
B9: Let ir < t < (1+ l)r. We show that Faulty?[sj] = false at time ir+2? and
therefore can become true only at time (1+ l)r+2?.
Since t > ir, for all 1': 1' < 1, 5?has fifo-broadcast (init, (alive, sj,t'r),--H,
) at time l'r. From Lemma 5.3.8, all correct servers fifo-deliver (init, (alive
si ,i1r),--H, --H) by time t'r + ?. Thus no more than f servers fifo-broadcast
(echo, (fault, Sj, lr),--H,--H). Since2f< ?s, f < ns?fandsos?cannotfifo-
deliver sufficient (echo, (fault, 5j, i1r),--H,--H) messages to set Faultyj[s?1 =
true before time (1 + l)r + 2?.
BlO: Let b be initiated at time t6 < t and let ir < tb < (1 + l)r. Assume first
that tb > ir. We show that 5k cannot omit to dequeue b. Again, as for
send?mission failures, we need to prove that 5k would have fifo-delivered
(echo, b,--H,t6) by time tb + 2?.
All correct servers must have fifo-delivered (init, b,5?,t6) (by Lemma 5.3.8)
94
and flfo-broadcasted (echo,b, --H,tb) before time (l+l)r+?. Since sk did not
halt by time (1 + l)r + 2?, it must have fifo-delivered (echo, (alive, ?j, (1 +
--H, --H) or (echo, (fault, ??, (1 + i)r), --H, --H) from at least ?s --H f servers.
Since ?s > 2f, at least one of these servers (say ?j) is correct and, therefore s?
is one of the servers that would have fifo-broadcasted (echo, b, --H, tb) earlier.
This, however, implies from Lemma 5.3.2 that ?k cannot omit to fifo-deliver
(echo, b, --H, tb).
Assume now that t6 = ir. First assume that Sj had not fifo--Hbroadcasted
(init, (alive, s1, tT),--H,--H) before (init, b, s?, t6). All correct servers must
have fifo-delivered (init, 6, s?, tb) (by Lemma 5.3.S) and fifo-broadcasted
(echo, 6,--H, t6) by time iT + ?. If sk did not halt by time IT + 2?, it must
have fifo-ddivered (echo, (alive, Sj, 1?), --H, --H) or (echo, (fault, sj,iT),--H,--H)
from at least ?s --H f servers. Since ?s > 2f, at least one of these servers (say
sj) is correct and, therefore Sj is one of the servers that would have fifo-
broadcasted (echo, 6,--H, t6) earlier. This, however, implies from Lemma 5.3.2
that sk cannot omit to fifo-deliver (echo, 6,--H, t6).
Finally, assume that s5 had tifo--Hbroadcasted (init, (alive, sj, ir), --H, --H) be-
fore (init, 6, Sj, tb). All correct servers must have fifo--Hbroadcast (echo, 6,--H, tb
), and therefore, from Lemma 5.3.2, they must have fifo--Hbroadcast (echo, (
alive, Sj, i?T), --H, --H) earlier for all 1' ? 1. Since ?s > 2f, sk cannot get ?s --H f
copies of (echo,(,fault, sj, 11r), --H, --H) for any 1' ? 1 and thus Fautty?[?j]
can become true earliest at time (1+ 1)r+2?. If s? fifo-delivers (echo, 6,--H, t6)
by time t? + 2?, then we are one. On the other hand, if sk omits to fifo-deliver
(echo, 6,--H, tb) by time t6 + 26 then it will never deliver 6 (by B2). For this
95
case, we show that 5k would halt by time (1 + 1)T + 2?. If fl0t, 5k must
have fifo-delivered (echo, (alive, s?, (1 + 1)r), --H, --H) or (echo, (fault, s?, (1 +
--H, --H) from at least n? --H f servers which was fifo-broadcasted after
(echo, b, --H, t6) was fifo-broadcasted. However, since ?k omitted to fifo-deliver
(echo, b, --H, t6) and n? > 2f, it couldn't have fifo-delivered all these messages
(by Lemma 5.3.2).
Bil: As for send?mission failures.
E
Theorem 26 The lower bound for blocking time when n? > 2f, and the lower
bound for failover over time in Table 11 for receive--Homission failures are tight.
Proof: The procedures have n3 > 2f and C = 0, and so the corresponding
lower bound on blocking time is tight. Also, from Lemma 5.2.17, the lower bound
on failover time is tight because d = 2?.			0
5.3.5 General--HOmission failures
Our last failure model is general omission failures. The broadcast procedure is
given in Figure 5.11. The remaining procedures are the same as receive?mission
failures. For this case, we define d = 2?, C = 2? and assume that n3 > 2f.
Lemma 5.3.11 The proceduresfor general--Homission failures satisfy properties BO
Bil
Proof:
96
procedure broadcast(i?, k)
time			current--Htime
fifo-broadcast(injt, ?I, sk, time)
if by time + d fifo--Hdelivered			(echo,??, sj,time)
for at least n3 --H f different j then return
else stop
Fi?ure 5.11: Procedures for ?eneral?mission failures
BO: Same as send--Homission failures.
Bi: Similar to send?mission failures.
B2: Similar to receive?mission failures.
B3: From the implementation of broadcast.
B4: Follows from Figures 5.4 and 5.10.
B5: Follows from Figures 5.4 and 5.10.
B6: Suppose that sk sets Fau1ty?[sj? = true at time tT + 2? < t because it did
not deliver (alive, sj, ir). If ?k did not deliver (alive, si, tr), it must have
fifo-delivered at least n3 --H f (echo, (fault, s5,tr),lr,--H). Since ?s > 2f, sj
could not have fifo-delivered n3 --H f (echo, (alive, Sj, lr),--H,--H) and would
have, therefore, halted by time ir + 2?.
B7: As in send?mission failures.
B8: Same as Section 5.3.4 if 5k halts in Figure 5.10 because Lemma5.3.9 continues
to hold for general?mission failures. So assume that 5k halts in broadcast
in Figure 5.11. However, again the argument is the same because at least
--H f correct servers would have fifo-broadcasted (echo, iW, --H,time) and 5k
would have fifo-delivered these.
97
B9: Suppose that s?sets Fau1ty?[sj1 = true at time lr+2?. Then sk did not deliver
(alive,Sj,17) and must have fifo-delivered at least n3--Hf (echo,(fault,?j, ir)
--H, --H). If s? fifo-broadcasted (init,b, Sj, t) after time Ir, it must have previ-
ously fifo-broadcasted (init,(alive,si, tr), sj, tr). Since b was successful,??
must have fifo-delivered at least ns?f (echo,b, Sj, t) from servers fsj?. There-
fore, by Lemma 5.3.2, tsil must have fifo-delivered (init,(alive,Sj,ir), Sj,
Ir), and therefore could not have fifo--Hbroadcasted (echo,(fault,sj,tr),sj,--H).
Now since ?s > 2J, sk could not have fifo-delivered at least ?s --H f (echo,
fault,si,ir),--H,--H), a contradiction.
BlO: Let b be initiated at time tb < t and let ir < tb < (1 + l)r. Assume first
that t6 > ir. We show that sk cannot omit to dequeue b. Again, as for
send?mission failures, we need to prove that sh would have fifo-delivered
(echo,b, --H,tb) by time tb + 2?.
From Figure 5.11, at least ns--Hf servers must have fifo-delivered (init,b, sj, tb
) and fifo-broadcasted (echo,b, --H,tb) before time (1+1)r+?. Since sk did not
halt by time (1 + l)r + 2?, it must have fifo-delivered (echo,(alive,Sj,(1+
or (echo,(fault,si, (1 + I)r),--H,--H) from at least n3 --H f servers.
Since ?s > 2f, at least one of these servers (say si) is one of the servers that
would have fifo-broadcasted (echo,b, --H, tb) earlier. This, however, implies
from Lemma 5.3.2 that 5k cannot omit to fifo-deliver (echo,b, --H, t6).
Assume now that t6 = ir. First assume that Sj had not fifo--Hbroadcasted
(init,(alive,si, ir ),--H,--H) before (init,I),5j,t6). At least ns --H f servers
must have fifo-delivered (init,b, s?, tb) and fifo-broadcasted (echo,b, --H, t6)
by time 1r + 6. If 5k did not halt by time tr + 26, it must have fifo-
98
delivered (echo, (alive, si, iT'),--H, --H) or (echo, (fault, s5,'r),--H,--H)fromat
least n8 --H f servers. Since n? > 2f, at least one of these servers (say aj)
is one of the servers that would have fifo-broadcasted (echo, b, --H, t6) earlier,
This, however, implies from Lemma 5.3.2 that sk cannot omit to fifo-deliver
(echo, & --H, t6)
Finally, assume that sj had fifo--Hbroadcasted (init, (alive, si, ir), --H, --H) be-
fore (init, b, s?, t6). At least ?s --H f servers must have fifo--Hbroadcast (echo, b,
--H, t6), and therefore, from Lemma5.3.2, these servers must have fifo--Hbroadcast
(echo,(alive, Sj, Pr),--H, --H) earlier for all 1' ? 1. Since ?s > 2f, sk cannot
get fls --H f copies of (echo,(,fault,s1,Pr),--H,--H) for any ii < 1, and thus
Fautty?[sj] can become true earliest at time (1 + 1)r + 2?. If 3k fifo-delivers
(echo, b, --H, t6) by time tb + 26, then we are one. On the other hand, if ?k
omits to fifo-deliver (echo, b, --H, t6) by time t6 + 26 then it will never de-
liver b (by B2). For this case, we show that ?k would halt by time (I +
1)r+26. If not, ?k must have fifo-delivered (echo,(alive,s?,(i+1)r), --H,--H)
or (echo, (fault, sj, (I + 1)r), --H, --H) from at least ?s --H f servers which was
fifo-broadcasted after (echo, b, --H, t6) was fifo-broadcasted. However, since
sk omitted to fifo-deliver (echo, b, --H, tb) and ?s > 2f, it couldn't have fifo-
delivered all these messages (by Lemma 5.3.2).
Bli: As in send?mission failures.
0
99
Theorem 27 The lower bounds in Table 4.1 for general--Homission failures are all
tight.
Proof: Properties 130 1311 are met for ?s > 2f which is optimal. From the
procedure broadcast, C = 2?, which is optimal. Finally, since Lemma 5.2.17
holds for all r > 0, the failover time is also optimal because d = 2?.			0
104
procedure primary()
cobegin
II inform--Hciients("Dest = so")
II status Olir
do forever
when received request from client c
response fl(stateo,request)
state0 stateo 0 response
status :=broadcast((log, so, (response,c)),0)
if status = NOTOK then
faiiure--Hmode--Hprimary(O)
od
coend
procedure backup()
do forever
((tag,5o,(r,c)),--H) := Deq(Rqueuei)
/? logging response from primary *1
if tag = log then
state1 := state1 0 r
send r to client c
/* becoming the primary ?/
if Faulty1 [so] then fai!ur?mode--Hprimary( 1)
od
procedure faiiure--Hmode--Hprimary(k)
cobegin
II inform--Hciients("Dest = 5k1,)
do forever
when received request from client c
response := fl(state?,request)
state? := state? 0 response
send response to client c
od
coend
_______Fi ure 6.2: Procedures for send?mission failures n? = 2, = 1 -i
105
procedure initialize(k)
state? := Rqueue?			Dquezte?
Vi: Fau1ty?[sj] false
procedure broadcast(?N1, k)
j k
time current--Htime
fifo-send(init, ??, 5k, time)
if by time + 2?: fifo-delivered (echo, ?i, si, time)
then return(OK)
if = time then t := time + T
else t := time
when current--Htime= [?rtl T + 2?:
return(NOTOK)
procedure deliver (M, k)
Let M = (tag, --H, --H)
if tag = log then Enq(Rqueue?, (?Ni, k))
procedure delivery-process(k)
3 := k
cobegin
II fifo-delivery.process(k)
do forever
(tag, iki, sj, t) := Deq(Dqueue?)
if tag = init then
fifo-send (echo, M, sk,t)
deliver (M, k)
od
coend
Figure 6.3: Procedures for send?rnission faiUres(n? = 2, f = 1)-ji
106
other server. Lemma 5.3.2 continues to hold with fifo--Hbroadcast replaced by
fifo--Hsend. As required by Corollary 6.0.1, this protocol is also "pass-the-buck"
Again we have assumed that there are no omission failures between the clients and
the servers. This assumption can be removed in a manner similar to Chapter 5.
procedure faiiure-detector( k)
Ajt =? (alive,s5,ir)
=? (fault, si, iT)
j			k
cobegin
for 1			0 to oc
when current--time = ir:
fifo-send(init, Ak', 5k, ir)
for 1=0 to oc
when current-time = ir + ?:
if not fifo-delivered (init, A,', Sj, ir) then
fifo-send (echo,FJ,5k,ir)
for 1=0 to oc
when current--Htime = ir + 2?:
if fifo-delivered (echo, Fk', Sj, ir) then stop
if not delivered A,' then Fau1ty?[sj] := true
coend
Figure6.4:Proceduresforsend?missionfaiUres(n = 2, = 1 -iii
There are two servers--Hs0 and si. Initially, so executes the procedure primary
and ?i executes the procedure backup. Since the protocol is "pass the buck"
initially the backup sends responses to the clients.
As long as there are no failures, it can be shown that the procedure broadcast
(described later) continues to return the status of OK. However, if a status of
NOTOK is returned, then again it can be shown that the backup si has failed
either by crashing or by send?mission. In this case, since f = 1 and 51 is faulty, so
must be correct and will never crash. Therefore, so continues to be the primary by
107
calling failure--Hmode--Hprimary. Since 5o is correct, it can be shown that ?1 will
never become the primary. As a result, so does not send information about state
updates to ?1 in failure--Hmode--Hprimary. Furthermore, now it is the primary
(i.e. so) that sends responses to the clients.
On the other hand, if the backup 51 is correct (notice that in this case broad-
cast can never return a status of NOTOK) and detects that the primary 5o has
crashed (by using the failure--Hdetector in Figure 6.4), then 51 becomes the pri-
mary by calling failure--Hmode--Hprimary.
As mentioned above, the procedure broadcast is used to send state updates to
?1 as long as 5o is executing the procedure primary. The broadcast procedure
sends a message to si, and then waits for an acknowledgement. If an acknowl-
edgement does not arrive, then a failure must have occured. If ?o is faulty, then
the description given earlier requires that a status of NOTOK never be returned.
This is ensured because of the following reason. If ?o is faulty, then it must have
omitted to send the message to si. By Lemma 5.3.2, ?i will not fifo--Hdeliver the
next alive message that 5o sends, and therefore, will send the fault message in
Figure 6.4. As a result, it can now be shown that 5o will halt (either by crashing
or by fifo--Hdelivering the above fault message in Figure 6.4) before it can return a
status of IVOTOK in procedure broadcast.
On the other hand, if so is correct, then broadcast will return a status of
NOTOK as required.
6.1.1 Proofofcorrectness
We now formally show that the protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies
Pb1--HPb5.
108
Define:
o+ Prmy30 = so has not halted
0 Prmy?1 =? Faulty1 [so] = true A ?1 has not halted.
Theorem 30 The protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies Pb1
Proof: Suppose that Prmy?1 is true at some time t. Let Prmy?1 become true
for the first time at ir + 2? ? t because Faultyi[so] = true at time ir + 26 in
Figure 6.4. si must not have delivered A01, which implies that so is faulty (as
there are only send omission failures). Since s? is correct, it would have fifo-sent
(echo, F01, 51,--H) at time lr + 6, and ?o would have halted by timeir + 26. Thus
Prmy?0 cannot be true at time t > lr + 26. c
Theorem 31 The protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies Pb2
Proof: Follows from Figure 5.2.
Theorem 32 The protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies Pb3
c
Proof: Easy to see since server s? j = 0,1 executes inform--Hclients only after
it becomes the primary.			0
Lemma 6.1.1 Let k = I. If sj broadcasts b before b', then 5k cannot deliver b'
before b.
109
Proof: Suppose ?& delivered b'. Thus 5k fifo-delivered (init, b', Sj, ). Since s5
fifo-broadcasted (init, I), s5, --H) before (init, b', 5?, --H), by Lemma 5.3.2, sk would
have fifo-delivered (init, b, s?, --H) before (init, b', s?, --H), and hence delivered b be-
fore b'.			c
Lemma 6.1.2 iVo correct server ever halls.
Proof: For contradiction, assume that s5 is the first correct server to halt at
time t and let k =
This implies that t = tr + 26 because sj must have halted in Figure 6.8 af-
ter it fifo-delivered (echo, Fjt, sk? --H). Since sj is correct, it must have fifo-sent
(init,A?,sj, ) at time ir and sk must have fifo-delivered this by time ir + 6
or 5k must have baIted. However, this implies that 5k could not have fifo-sent
(echo, FJ, 5k, --H), a contradiction.			c
Lemma 6.1.3 If 5j, ? = 0,1, calls failure--Hmode--Hprimary, then sj is correct.
Proof: For contradiction, assume that sj is faulty and calls failure--Hmod?
primary at time t = ir + 26.
Case 1: j = 1. From Theorem 30, ?o must have halted and from Lemma 6.1.2
?i must be correct.
Case 2: j = 0. Assume for contradiction that so is faulty and ?i is correct.
From Figure 6.3, s? did not fifo-deliver (echo, M, si, --H) for some M which, since
?i is correct, implies that ?i did not fifo-deliver (init, i'J, so, --H) which so was
supposed to send at time t' (say). If [L;17 = t', then let t" = t' + r, else let
110
= t?. Therefore, from Figure 6.3, ?t?17 + 26 = t = lT + 26, or = 1. Since
51 did not fifo-deliver (Init, Al, so,--H), from Lemma 5.3.2, it will not fifo-deliver
(init, A01, so,--H), and so will fifo-send (echo, F0?,51,--H) This, however, implies that
5o would have fifo-delivered (echo, F01, si, --H) and halted by time IT + 26. Thus, so
cannot call failure--Hmode--Hprimary, a contradiction. E
Lemma 6.1.4 If Prmy30= false at time t, then Prmy?1= true by time t+T+26.
Proof: If so halts at time t, where (1 --H l)r < t < iT, then by Lemma 6.1.2,
so is faulty and si is correct and so sj will never halt. si will not deliver A01 and
become the primary by time iT + 26 ? t + r + 26.			0
Lemma 6.1.5 Let si become the primary. Then if so sent sequence of broadcasts
b0 o b1 ... bn then
1. si must have delivered b0 o b1 ... bn?1 by time $ 1
2. if bn = (log, 5o, (r?, --H)) and r? was sent to the client, then si must have
delivered bn by time $ 1.
Proof: If s? becomes the primary, then by Lemma 6.1.3, si must be correct,
and so must have halted before calling failure--Hmod?primary.
Since 5o must have broadcasted bi, i = 0... n --H 1, and did not call failure--H
mode--Hprimary, from Figures 6.2 and 6.3, so must have fifo--Hdelivered (echo, bj, si,
--H), which implies that 51 must have delivered bi. (1) now follows from Lemma
6.1.1.
111
(2) is trivially true since only ?i could have sent r? to the client before becom-
ing the primary.			c
Lemma 6.1.6 If si becomes the primary and so halt at time ? 0 < t 1, then
1. Init? E Final0
2. If Final0 = ,?? and R0 $ e, where R0 E ?`1o, then R0 was sent to the client,
3. If mit1 E Final0, then last(Finalo) was not sent to the client.
Proof: As before, so must have halted before it could have called failure--H
mode--Hprimary. Let Final0 = ro 0 ri ... 0 r?. If ?o broadcasted (log, 5o, (r?, --H))
then it must broadcasted (log, ?o, (r?, --H)), i = ... . n, and therefore from Lemma
6.1.5, (log,so,(ri,--H)), i = O...n--H 1, must have been delivered at si and (1) is
satisfied.
Now assume that so does not broadcast (log, ?o, (r?, --H)) and therefore halts.
Therefore, from Lemma 6.1.2, s? is faulty and so si is correct. s? must have broad-
casted (log, so, (ri, --H)), i = 0.. n--H1 and received (echo, (log, 5o, (ri, --H)), si). (1)
now follows from Lemma 6.1.1.
(2) can be seen to be true from Figure 6.2 because ?o must have fifo-delivered
(echo, (log, ?o, rj), si), i = ... n --H 1 and since ?i is correct, it must have sent r?,
i = ... . n --H 1, to the client.
(3) is trivially true because only si could have sent the response to the client. 0
Let p = flc(T + 2?) + ?.
112
Lemma 6.1.7 Jf 5o remains the primary through time t, then there can be at most
one request from any client i in the queue of 5o at time t.
Proof: Assume, for contradiction, that t2 < t is the first time that there are
two requests (say mi and rn2) from some client i in the queue of 5o Assume mi
was sent at time ti and so it was enqueued by t1 + 6. From Figure 5.2, 2p =
2(nc(r + 26) + 6) and so client i cannot send m? before time t1 + 2(nc(r + 26) + 6)
i.e. t > t2 > ti + 2(nc(r + 26) + 6). Since m2 is the first duplicate request, there
are at most flc --H 1 requests ahead of mi in 5o'5 queue Therefore, from Figures 6.2
and 6.3 , so will dequeue mi by time t1 + 6 + (nc --H 1)(r + 26) < t. Therefore,
t9 ? t1 + 6 + (nc --H 1)(r + 26) which is a contradiction. E
Lemma 6.1.8 If a request m is sent by the client at time tm to server sj, and sj
remains the primary through the interval [trn..trn + pj, then m is received by sj by
time tm + p.
Proof: Since the message delivery time is at most 6, m is enqueued at sj by
timetrn+6. Ifj = 0, then from Lemma6.1.7, there can be at most n?--H 1 requests
ahead of m in 5?'s queue. Therefore, from Figures 6.2 and 6.3, m is received by
time tm + (nc --H 1)(r + 26) + 6 < tm + p.
On the other hand, if 5 = 1, then from Figure 6.2, m is received by time
tm +6< tm + p.			E
Lemma 6.1.9 Let 5o call fai1ure--Hmode?primary at time ? 0 and stateo(? 0) =
ro ..... . r?. Let m? be the request corresponding to r?. Then m? must have been
received in the interval [max(0, ? 0 --H r --H 26). .? 0]
113
Proof: From Lemma 6.1.2, so is correct and did not deliver
(echo, (log, 5o, (r?, --H)), 51,--H). Suppose for contradiction that m? was received at
time t < ? 0 --H --H 26. Since so did not deliver (echo, (log, so, (r?, --H)), 51,--H), it
must have called failur?mode--Hprimary by time t + T + 26 < ? 0, a contradic-
tion.			E
Lemma 6.1.10 Let so halt and Final0 = r0 0 r1 ... r?. Let m? be the request
corresponding to r?. If r? was not sent to the client, then m? must have been
sent in the interval [max(O, ? 0 --H r --H 26 --H p)..? 0] and received in the interval
[max(0,? 0--H r --H26)10]
Proof: From Lemma 6,1.2, so is faulty and could not have called failure--H
mode--Hprimary. Suppose for contradiction that m? was sent in the interval
?..l 0 --H 7 --H 26 --H p). From Lemma 6.1.8, m? must have been received by time
t < I 0 --H r --H 26 If ?,,?]7 = t, then let t? = t + r, else let t' = t. Since
so has not halted at time t + ? + 26 < I 0, so could not have fifo-delivered
(echo, (fault, 5o, ?t;1 r), 51,--H). Since 51 is correct, this implies that ?1 must
have fifo-delivered (init, (alive, so, [?`] r), so, --H) and therefore also fifo-delivered
(init, (log, so, (r?, --H)), so,--H) (by Lemma 5.3.2), which implies that r? must have
been sent to the client, a contradiction.
For the second part, if m? is received at some time t < I 0 --H 7 --H 26, then this
is a contradiction as proved in the previous paragraph. Therefore m? can be only
received in the interval [max(0, I 0--H7 --H 26)10]. E
Lemma 6.1.11 For any run o'PB of the primary-backup system there exists a run
114
op ofthe (p,2,2T + 5?+ p) bofo server P such that Vj : op ? op?.
Proof: Case 1: Prmy30 is always true in 0PB and 5o never calls failure--Hmod?
primary. If ?o makes infinitely many broadcasts in 0p?, then from Figure 6.6,
all requests have responses. Therefore, assume that all the requests m that si
responds to are responded to in op as well. m is received at the same times in
0PB and op. However9 m is enqueued at the same time it is received in op (this
has to be true of all runs of P). \?`e can do this from Lemma 6.1.8. Furthermore,
in op the response r has to be sent at the same time the request m is received9
whereas in op?, r might be sent after ? time units by si. However, r has to be
received at the same times by the clients in op? and op. We can do this because
r is received by the client within 2? after m is received, and p > 2?. However, if so
makes only finitely many broadcasts, then again from Figure 6.6, the last request
might not have a response. Assume that the last request m? is received by 5o at
time t. Let the first (and only) outage interval of op be [t..tl and assume in op
that the request rn is received, but a response never sent.
Case 2: Prmy?0 is always true in op?, but 5o calls failure--Hmode--Hprimary
at time ? 0. We construct op until time ? 0, and then show how to extend it.
Let Final0 = ro 0?1... r??i 0 r?. Again, as in case 1, the same responses are
sent in both 0PB and op. However, as can be seen from Figures 6.2 and 6.3,
r??i and r? might not have been sent to the client. Let m??1 and m? be the
requests corresponding to r??1 and r? respectively, which were received at times
tn?1 and tn respectively. Let the two outage intervals of op be li = [tn?i..tn?ij
and 12 = [max(O,? 0 --H r --H 2?)..? 0]. If r??1 was not sent to the client, assume
that in op, inn--Hi was received, but a response never sent (outage belongs to li).
115
Similarly for r? since by Lemma 6.1.9, rn? must have been received in `2
ap can be easily extended beyond ? 0 because all requests received after ? 0
will have a response.
Case 3: so halts at time ? 0 in apB. We first construct the run ap until time
? 1, and then show how to extend it.
By Lemma 6.1.4, si becomes the primary at time t 1 < ? 0 + r + 2?. As in
case 1, all requests sent in the interval [0..? 0] that have responses by si in aPB
are responded to in ap as well. Let Final0 = roori... r??1 or?. By Lemma 6.1.6,
?n might not have been sent to the client. As in case 2, let m? be the request
corresponding to r? which was received at time tn by so. Let the only outage
interval of ap be li = [max(0,l0--Hr--H2?)..t 1]
From Lemma6.1.6, mit1 = R = roori r??1 or mit1 = `I = roori .. r??1o
r?. If mit1 = R, then r? was not sent to the client and from Lemma 6.1.10,
m? was sent in the interval [rnax(0, ? 0 --H T --H 2? --H p)..? 0]. We now show that
Initi = statep(t 1). From Lemma6.1.6, roori . . .r??1 must have been sent to the
client and from construction, these responses are sent in ap as well. Therefore, we
are done if we can show that P receives no further requests until time $ 1. Assume
in ap that m? is never enqueued (outage belongs to li) or hasn't yet been enqueued,
depending on the time m? is actually sent. Futhermore, all other requests that
were sent in the interval [0..t 1] that haven't been received in ap?[t 1] must have
been sent in the interval [? 0 --H p..? 1] (Lemma 6.1.8). Assume, therefore, that in
ap these requests are never enquened (outage belongs to Ji) or haven't yet been
enqueued in ap. Therefore P receives no further requests and mit1 = statep(t 1).
On the other hand, if mitt = M, then r? must have been sent to the client
116
and it can again be shown that Init1 = statep(t 1).
We now extend ap. Extend li as li = [max(O,? O--Hr--H2?)..t 1+?+p1 There-
fore, 11i1 < 2r+5?+p(fromLemma6.1.4). We first show that Vj : queue??,p(t 1) =
?. By the above construction of ap, all requests sent before ? 1, other than rfl?,
that are received in ap?[t 1] are received in ap as well. Also, m? is not enqueued
at time ? 1. Finally, all requests sent before t 1 that are not received in ap?[$ 1]
must must have been sent in [? 0 --H p..t 1] (by Lemma 6.1.8). Assume that these
are never enqueued in ap (again outages lies in li) or haven't yet been enqueued.
Thus, Vj : queuec??p(t 1) = ?. Furthermore, requests sent before ? 1 to so will
never be enqueued after ? I in ap
It is now easy to extend ap. Assume now that all requests that clients send in
[t 1..t 1 +? to ?o are never enqueued in ap. This outage again lies in li. However,
any respose that 51 sends in ap? after t 1 is sent in ap as well. We can do this
since mit1 = statep($ 1), queuec?,p(? 1) = = queuec?,?0(t 1) and no request
sent to so is enqueued after ? 0 in ap . Since, by Lemma 6.1.3, 51 is correct and
will never halt, we are done.			E
Theorem 33 The protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies Pb4.
Proof: Follows from Lemma 6.1.11.
E]
Theorem 34 The protocol in Figures 6.1,6.2, 6.3 and 6.4 satisfies Pb5.
Proof: A server continues to be the primary until it halts. The result now
follows from Lemma 6.1.2.			c
117
Theorem 35 The protocol in Figures 6.1,6.2, 6.3 and 6.4 is ?-bThcking.
Proof: Follows from the description of the protocol.
6.2 General--HOmission Failures
c
In this section, we give the ?--Hblocking protocol tolerating general--Homission failures.
The protocol is shown in Figures 6.5,6.6, 6.7 and 6.8, and for this protocol f = 1 and
= 2f+ 1 = 3. The procedures initialize, deliver and failure--Hmode--Hprimary
are the same as those for send?mission failures. As required by Corollary 6.0.2,
this protocol is ?pass-the-buck"
initialize(i)
cobegin
II if i = 0 then primary() else if i = 1 then backupl() else backup2()
II delivery--Hprocess(i)
II failure--Hdetector(i)
coend
Fi ure 6.5: Pr?tocol run by server 5j for generJ??mission failures(n? = 3, f = 1)
There are three servers--Hs0, si and ?2 As we will see below, in this protocol,
only so and ?2 can become primaries. Initially, 5o executes the procedure primary,
51 executes the procedure backupi and ?2 executes the procedure backup2. Since
the protocol is "pass the buck", initially the backup 5i sends responses to the
clients. Furthermore, since si can never become the primary, it does not need to
keep its state up-to-date with the state of the primary.
As long as there are no failures, the procedure broadcast continues to return
the status of OK. However, if a status of NOTOK is returned, then it can be
118
seen that either so or si must be faulty, which implies that s2 is correct. The
protocol now ensures that s2 becomes the new primary by calling failure--Hmode--H
primary. This is done as follows. When broadcast returns a status of NOTOA',
?o calls the procedure inform--Hnew--Hprimary, and sends a takeovermessage. If
it receives an acknowledgement to this message from s2, then it knows that ?i will
become the primary by calling failure--Hmode--Hprimary in Figure 6.6. If it does
not receive an acknowledgement, then it knows that it is faulty (since s2 is correct
as argued above), and halts itself. In this case, 52 will again become the primary
as it will detect that so has halted in Figure 6.8. Once 52 becomes the primary, it
continues to be the primary since it is correct.
6.2.1 Proofofcorrectness
now formally show that the protocol in Figures 6.5,6.6, 6.7 and 6.8 satisfies
Pb1--HPb5.
Define:
o+ Prmy?0 =? status = OK A ?o has not halted
0 Prmy31 =A false
A
0 Prmy82 = (Fault [s0] = true V delivered (tak:ecver,?o,--H)) As2 has not
halted.
Theorem 36 The protocol in Figures 6.5,6.6, 6.7 and 6.8 satisfies Pbl
119
procedure primary() /? server so
cobegin
II inform--Hciients("Dest = so")
status			OK
do forever
when received request from client c
response			fl(stateo,request)
stateo stateoo response
status :=broadcast((log, 5?, (response,c)),0)
if status = NOTOK then inform--Hnew--Hprimary()
od
coend
procedure backupl() /? server si
do forever
((tag,?o,(r,c)),--H) := Deq(Rqueuei)
/? responding to client ?/
if tag = log and fifo?delivered (init, (tag,5o,(r,c)),?o,--H) then
send r to client c
od
procedure backup2() /* server ?2
do forever
((tag,5o, (r,c)),--H) := Deq(Rqueue2)
/? logging response from primary ?/
iftag = log then state? := state?0r
od
1* becoming the primary ?/
if Faulty2[so] or delivered (takeover, so,--H) then
faiiur?mode--Hprimary(2)
Fi?ure6,6:Proceduresfor?en&al?missionfailures(n= 3 = 1 -i
120
procedure inform--Hnew--Hprimary()
time			current--Htime
fifo-broadcast( init (takeover, so, time), so, time)
if not fifo-dejivered (echo, (takeover, so, time), 52, time)
by time + 2? then stop
else loop forever
procedure broadcast(i'l, k)
time			current--Htime
fifo-broadcast(init, ?`l, 5k? time)
if by time + 26 not fifo-delivered (echo, Al, 5k+1, time) then
return(]VOTOK)
else return(OK)
procedure delivery-process(k)
cobegin
II fifo-delivery-process(k)
do forever
(tag, ?`l, --H, t) :=Deq(Dqueue?)
if tag = init then fifo-broadcast (echo, M, 5k, t)
if tag = echo and not delivered M before then
deliver (M, k)
od
coend
Figure 6.7: Procedures for generat?omission failures(n? = 3, f = 1)-ji
121
procedure failure-detector( k)
Vi,j : A,1 rn--H (alive, sj,1r)
Vi,):FJ=?(fault, Sj, 17)
cobegin
I for 1			0 to oc
when current-time =
fifo-broadcast(init, A'k, sk, 17)
I for 1			0 to cc
when current-time = 17 + ?:
Vj : if not fifo-delivered (init, A,?, sj, 17) then
fifo-broadcast (echo, FJ, sk, 1r)
fort =0 to cc
when current--Htime = ir + 2?:
witness?[k] := fsjlfifo-dehvered (echo, A1k, Sj, 1r)?
Vi $ k : witness?[j1
fsjlfifo-delivered (echo, A31, Sj,
or fifo-delivered (echo, FJ, Sj, 17)t
if 3j : Iwitncss?[j?I < 2 then stop
if ?j : not delivered A31 then Fau1ty?[sj1 := true
coend
Figure 6.8: Procedures for general?mission faiUres(n? = 3, f = 1)-iji
Proof: We show that ?(Prmys0 A Prmy?2) at any time. Suppose that Prmy?2
is true at some time t.
Case 1: Prmy?2 becomes true for the first time at ir + 26 ? t because
FaultY2 [so] = true at that time (Figure 6.8). s2 must have fifo--Hdelivered at least
two copies of (echo, F01, --H, --H), which implies that ?o could not have fifo--Hdelivered
two copies of (echo, A01, --H, --H) because n? = 3. Therefore, from Figure 6.8, so would
have halted by time 1r + 26. Thus Prmy?0 cannot be true at time t> 1r + 26.
Case 2: Prmy?2 becomes true for the first time at t' < t because it deliv-
ered (takeover, so, --H) at time t'. This however implies that so must have called
inform-new-primary (Figure 6.7) and so status --H YOTOK (Figure 6.6) by
122
time t' and so Prmy?0 cannot be true at time t > t!.
Theorem 37 The protocol in Figures 6.6, 6.7 and 6.8 satisfies Pb2.
Proof: Follows from Figure 5.2.
E
Since so can become the backup and still not halt, in order to satisfy Pb3, we
need some mechanism such that no requests are enqueued at so after it becomes
the backup. So we make the assumption that such a mechanism exists and discards
all requests that arrive at a backup. Pb3 now holds.
Theorem 38 The protocol in Figures 6.6, 6.7 and 6.8 satisfies Pb3.
Proof: Follows from the above assumption.
Lemma 6.2.1 If so broadcasts b before b', then ?2 cannot deliver b' before b.
E
Proof: Suppose ?2 delivered b' before b and ?2 fif0-dehv&ed (echo, b', 5j, --H) from
5j. This implies (from Figure 6.7) that 5j must have fifo-delivered (init, b', so,--H)
from 5o Since ?o flfo-broadcasted (init, b, so, --H) before (init, b?, so, --H), by Lemma
5.3.2 5j must have fifo-delivered (init, b, so, --H) before (init, b', so, --H), and there-
fore must have fifo-broadcasted (echo, b, 5j, --H) before (echo, b', 5j, ). Again by
Lemma 5.3.2, ?2 must have fifo-delivered (echo, b, 5j, --H) before (echo, b', Sj, --H) and
hence delivered b before b'.			E
Lemma 6.2.2 No correct server ever halts.
123
Proof: For contradiction, assume that 5j is the first correct server to halt at
time t.
Case 1: 5j halted in Figure 6.8. This implies that t = Jr + 26. First suppose
witnessi[i] ? 2. However, since 5j is correct and at least one of the other two servers
Is correct (say sj) and hasn't halted by time t = Jr + 6 (si is the first correct server
to halt), 5j will fifo-deliver (echo, A?, 5j, --H) and (echo, A12, s?, ) and so cannot halt.
Similarly if witnessj[k] ? 2, k ? i, then 5j will fifo-deliver (echo, ?A1k, F1l 5
and (echo, (A1k, F1l 5			) and so cannot halt.
Case 2: 5j halted in inform--Hnew--Hprimary in Figure 6.7 and so i = 0. so
called inform--Hnew--Hprimary because it did not fifo-deliver (echo, M,s1, --H) for
some A?I, and 50 51 is faulty (so is correct by assumption). Hence, ?2 is correct and
so so will fifo-deliver (echo, (takeover, so,--H), ?2, --H) and cannot therefore cannot
halt.			0
Lemma 6.2.3 If ?2 becomes the primary, then ?2 is correct.
Proof: For contradiction, assume that ?2 is faulty and becomes the primary at
time t.
Case 1: ir + 26 = t because Fautty2[so] = true at time Jr + 26. ?2 must
have fifo--Hdelivered at least two copies of (echo, Fol, --H, ?), which implies that so
could not have fifo--Hdelivered two copies of (echo, Alo, --H, --H) because ?s = 3 Thus
so 5o would have halted by time Jr + 26 and from Lemma 6.2.2, 5o is faulty, a
contradiction since f = 1.
Case 2: ?2 delivered (takeover, so,--H) at time t. Thus, so did not fifo-deliver
(echo, A/I, 51,--H) for some i? in Figure 6.7 and so either si or ?o is faulty, again a
124
contradiction since f = 1.
Lemma 6.2.4 If Prmy?0= false at time t, then Prmy?2= true by time t+r+4?.
Proof: If so halts at time t, where (1 l)r ? t < ir, then by Lemma 6.2.2,
so is faulty and 52 is correct and 50 ?2 will never halt, ?2 will not deliver A01 and
become the primary by time 17' + 26 < t + ? + 26.
On the other hand if status = NOTOK at time t, then so did not fifo-deliver
(echo, i'l, si,--H) for some Al and so either ?1 or 5o is faulty. Therefore 52 is correct
and will never halt by Lemma 6.2.2. If so does not halt by time t+26 in Figure 6.7
then ?2 delivered (takeover, ?o, --H) by t + 26. On the other hand, if ?o halts by
time t + 26, where (1 --H l)r < t + 26 < l?, then again ?2 will become the primary
by time lr + 26 < t + r + 46.			c
Lemma 6.2.5 Let ?2 become the primary at time ? 2. Then if so sent sequence of
broadcasts bo 0b1... bn then
I ?2 must have delivered bo 0 bi ... bn?i by time t 2.
2. if bn = (log,so,(rn,c)) and r? was sent to the c, then ?2 must have delivered
bn by time $ 2.
Proof: If s2 becomes the primary, then by Lemma 6.2.3, ?2 must be correct.
Case 1: ? 2 = ir + 26 because ?2 did not deliver A01. If so had broadcasted
bi, i = ... . n --H 1, by time 17', then from Figure 6.7, both so and si must have
fifo--Hbroadcasted (echo, bj, --H, --H) by time ir + 6, and since at least one of ?o and
125
Si is correct and ?2 is also correct, ?2 must have fifo--Hdelivered at least one copy of
(echo, bj, --H, --H) by time ir+2? and we are done by Lemma6.2.l. Similarly, if r? was
sent to the client, then both ?o and ?1 must have fifo-broadcasted (echo, bn, --H,
and ?2 must have delivered this.
On the other hand, if ?o broadcasted bj, i = ... . n --H I after time IT, then
so must have fifo--Hbroadcasted (init, A0?, so, lr) earlier, and since 51 fifo--Hdelivered
(init, bi, so, --H) it must have fifo--Hdelivered (init, A01, so, iT) (by Lemma 5.3.2).
Thus, both so and s1 must have fifo--Hbroadcasted (echo, A?, --H ,lT), which implies
that ?2 must have delivered A01, a contradiction. Similarly for bn.
Case 2: 52 delivered (takeover, so, --H) at time t 2 and ?2 fifo-delivered
(echo, (takeover, so, --H), sj, --H) from sj. Since so fifo-broadcasted (init, bi, so,--H),
i = 0. .. n, before (init, (takeover, ?o, --H), ?o, --H), by Lemma S.3.2,sj must have
fifo-broadcasted (echo, bi, si, --H) before (echo, (takeover, so,--H), si, --H) and ?2 must
have fifo-delivered them in the same order (again by Lemma 5.3.2). c
Lemma 6.2.6 Let ?2 become the primary at time t 2 and Prmy?0= false for the
first time at ? 0 < t 2. Then
1. mit2 E Final0
2. If Finalo = M0 and R0 # e, where Ro C M0 and IiNlol --H P0i = 2, then Ro
was sent to the client.
3. If Init? E Final0, then last(Finalo) was not sent to the client.
Proof: Let Final0 = ro orl... 0 r?. If so broadcasted (log, so, (r?, --H)) then
it must broadcasted (log, ?o, (r?, --H)), i = ... . n, and therefore from Lemma 6.2.5
126
(log,so,(ri,--H)), i = O...n--Hl., must have been delivered at 52 and (1) is satisfied.
If (log, 5o, (r?, --H)) was not delivered at 52, then r? was not sent to the client
(Lemma 6.2.5) and (3) is satisfied.
If ?o does not broadcast (log, ?o, (r?, --H)), then it must have halted. Therefore,
from Lemma 6.2.2, so is faulty and so ?i is correct (f = 1). ?o must have broad-
casted (log, so, (ri, --H)), i = ... . n --H 1 and since si is correct, r??1 must have been
sent to the client (from Figures 6.6 and 6.7). (1) now follows from Lemma 6.2.5.
(3) is trivially true because 5o does not broadcast (log, so, (r?, --H)) and so r? could
not have been sent to the client.
(2) can be seen to be true from Figure 6.6 because so must have fifo-delivered
(echo,(log,so,r?),s1,--H), i = 0...n--H1 and so si musthavesentr?,i =
to the client.			c
Let p = n?(26) + ?.
Lemma 6.2.7 If ?o remains the primary through time t, then there can be at most
one request from any client i in the queue of ?o at time t.
Proof: Assume, for contradiction, that t2 < t is the first time that there are two
requests (say mi and m?) from some client i in the queue of so. Assume mi was sent
at time t1 and so it was enqueued by ti + ?. From Figure 5.2, 2p = 2(nc2? + ?) and
so client i cannot send m2 before time t1 +2(nc2?+?) i.e. t > t2 > t1 +2(nc2?+6).
Since m2 is the first duplicate request, there are at most ?c --H 1 requests ahead of
?i in 5o'5 queue Therefore, from Figures 6.6 and 6.7 , 5o will dequeue mi by time
ti+?+(nc--H1)2? < t. Therefore, t2 < ti+?+(nc--Hl)2? which is acontradiction. C
127
Lemma 6.2.8 If a request Tn is sent by the client at time tm to server sj, and sj
remains the primary through the interval [tm..tm + pj, then m is received by 5? by
time tm + p.
Proof: Since the message delivery time is at most ?, m is enqueued at s? by
time t? + ?. If j = 0, then from Lemma 6.2.7, there can be at most n? --H 1 requests
ahead of m in s?'s queue. Therefore, from Figures 6.6 and 6.7, m is received by
time tm + (nc --H 1)2? + 6 < tm + p.
On the other hand, if I = 2, then from Figure 6.2, m is received by time
tm +6 ? trn + fi.			E
Lemma 6.2.9 Let Prmy?0= false for the first time at ? 0 and Final0 = ro 0
.... . r?. Let Tn? be the request corresponding to r?. If r? was not sent to the
client, then Tn? must have been sent in the interval [max(0, ? 0 --H 26 --H p)..l Ol and
rece?ved in the interval [max(0, ? 0 --H 26). .? 01.
Proof: Suppose for contradiction that Tn? was sent in the interval [0.. ? 0 --H
26 --H p). From Lemma 6.2.8, Tn? must have been received by time t < ? 0 --H
26 by so, and since Prmy?0= true at time t + 26 < ? 0, so must have received
(echo, (log, ?o, r?), si,--H) and status = OK. Since r? was not sent to the client,
si must have halted by time t + 6 < ? 0. Therefore, form Lemma 6.2.2, ai is
faulty and so 5o is correct and can never halt. Since ?o makes no broadcasts after
(log, 5o, (r?, --H)), the only way that Prmy?0= false at time ? 0 is that so halts at
time ? 0, a contradiction.
Now if Tn? was received at some time t < ? 0 --H 26, then this is a contradiction
as shown in the previous paragraph. Therefore, Tn? can only be received in the
128
interval [max(O, ? 0 --H 26)..$ o?.
Lemma 6.2.10 For any run aPB of the primary?backup system there exists a run
?p of the (p, 2, r + 7? + p) bofo server P such that Vj : ?p ?? ap?.
Proof: Case 1: Prmy30 is always true in aPB. If so makes infinitely many
broadcasts in ap?, then from Figure 6.6, all requests have responses. Therefore,
assume that all the requests m that si responds to are responded to in ap as well.
m is received at the same times in aPB and ap. However, m is enqueued at the
same time it is received in ap (this has to be true of all runs of P). We can do
this from Lemma 6.2.8. Furthermore, in ap the response r has to be sent at the
same time the request m is received, whereas in apB, r might be sent after 6 time
units by si. However, r has to be received at the same times by the clients in ap?
and ap. We can do this because r is received by the client within 26 after Tn is
received, and p > 26. However, if ?o makes only finitely many broadcasts, then
again from Figure 6.6, the last request might not have a response. Assume that the
last request Tn? is received by so at time t. Let the first (and only) outage interval
of ap be [t. .t] and assume in ap that the request m is received, but a response
never sent.
Case 2: Prmy?0 is false for the first time at time $ 0 in ap?. We first construct
the run ap till time t 2, and then show how to extend it.
By Lemma 6.2.4, s2 becomes the primary at time t 2 ? ? 0 + T + 46. As in
case 1, all requests sent in the interval [0..? 0? that have responses by ?1 in oPB
are responded to in ap as well. Let Final0 = ro on... r??1 0 r?. By Lemma
6.2.6, r??i and r? might not have been sent to the client. Let Tnn?1 and rn? be
129
the requests corresponding to r??1 and r? respectively and these requests were
received at times tn--Hi and tn respectively by se. Let the two outage intervals of
ap be Ii = [tn?i..tn?i] and `2 = [max(O, ? 0 --H 26)..t 2]. If r??i was not sent to
the client, assume that in ?p, ?n--HI was received, but a response never sent. This
outage belongs to I?.
From Lemma6.2.6, mit2 = R = roori . . . r??i or mit2 = Al = roori . . r??io
T?. If mit2 = R, then from Lemma 6.2.6, r? was not sent to the client and from
Lemma 6.2.9, m? was sent in the interval [rnax(0, ? 0 --H 26 --H p)..? 0]. ?`e now
show that mit2 = statep(? 2). From Lemma 6.2.6, ro 0 ri ... ?n?2 must have been
sent to the client and from construction, these responses are sent in ?p as well.
Furthermore, as argued before, m??i is always received in ?p. Therefore, we are
done if we can show that P receives no further requests until time t 2. Assume in
ap that m? is never enqueued (outage belongs to `2) or hasn't yet been enqueued,
depending on the time m? is actually sent. Futhermore, all other requests that
were sent in the interval [0..$ 2] that haven't been received in ap?[t 2] must have
been sent in the interval [? 0 --H p..t 2] (Lemma 6.2.8). Assume, therefore, that in
ap these requests are never enquened (outage belongs to `2) or haven't yet been
enqueued in ap. Therefore P receives no further requests and ?nit2 = statep($ 2).
On the other hand, if Init2 = i? and r? is sent to the client, we can similarly
show that the states are equal. On the other hand, if r? is not sent to the client,
assume that in ap, m? is received at time t where ? 0 --H 26 < t < ? 0 (from Lemma
6.2.9) but a response never sent (outage belongs to `2).
We now extend ap. Extend `2 as `2 = [?ax(0,? 0--H26)..t 2+6+p]. Therefore,
1121 < T + 76 + p (from Lemma 6.2.4). We first show that Vj : queue??,p
130
By the above construction of ap, all requests sent in [0. .? 2] other than m?, that
are received in ap?[t 2] are received in ap as well. Also m? is not enqueued at
time ? 2. Finally, all requests sent in [0..t 2] that are not received in ap?[? 2]
must must have been sent in [I 0 --H p..$ 2] (by Lemma 6.2.8). Assume that these
are never enqueued in ap (again outages lies in `2) or haven't yet been enqueued.
Thus, Vj : queuec?,p($ 2) = ?. Furthermore, requests sent before ? 2 to so will
never be enqueued after ? 2 in ap
It is now easy to extend ap. Assume now that all requests that clients send in
[$ 2..$ 2+? to so are never enqueued in ap. This outage again lies in 12. However,
any respose that ?2 sends in ap? after $ 2 is sent in ap as well. "Ve can do this
since Jnit? = statep(t 2), queuec?,p(t 2) = = queuec1,?0(t 2) and no request
sent to 5o is enqueued after ? 0 in ap . Since, by Lemma 6.2.3, ?2 is correct and
will never halt, we are done.			E]
Theorem 39 The protocol in Figures 6.5,6.6, 6.7 and 6.8 satisfies Pb4.
Proof: Follows from Lemma 6.2.10.
0
Theorem 40 The protocol in Figures 6.5,6.6, 6.7 and 6.8 satisfies Pb5.
Proof: If so is correct, then by Lemma 6.2.2, it never halts. Therefore, so can
become the backup only if it does not receive a (echo, i?, 51,--H), which implies
that ?i must have failed.
On the other hand, if ?2 becomes the primary and is correct, then again by
Lemma 6.2.2, it cannot halt and continues to be the primary. 0
131
Theorem 41 The protocol in Figures 6.5,6.6, 6.7 and 6.8 is ?-bThcHng.
Proof: Follows from the description of the protocol.
6.3 Receive--HOmission failures
c
Finally, in this section, we give the ?--Hblocking protocol tolerating receive--Homission
failures. The protocol is shown in Figures 6.9, 6.10, 6.11 and 6.12, where f = 1
and ?s = 2f = 2. The procedures initialize, deliver and delivery--Hprocess are
the same as those for send??mission failures. This protocol works for all values of
D and V and, as required by Corollary 6.0.3, is "pass-th&buck". Furthermore, as
required by Theorem 10, this protocol has a run in which a correct primary cedes
to a faulty backup. We assume for this protocol that r > 4?.
initialize?(?)
cobegin
if i = 0 then primary() else backup(i, 0)
II delivery--Hprocess(i)
coend
Fi ure 6.9: Protocol run b server 5j for receive--Homissipn failures(n? = 2, f = 1)
There are two servers--Hs0 and si. Initially, so executes the procedure primary
and ?1 executes the procedure backup.
In the procedure primary, the time interval r is divided into two disjoint
intervals (however, it turns out that we do not require the intervals to be completely
disjoint and can use r > 2?). In the first interval, the primary sends a alive
132
procedure primarY()
cobegin
II inform--Hclients("Dest = so")
I pstatus			OK
Isprimary[O, 1]			true
7* the alive message is being sent now *7
for			0 to oc do
when current--time = 1T: fifo-send(init,A01, 5o,
7* check if failure has occured ?7
when current--time = ir + 26:
if not fifo-delivered (echo, A?0, si,IT) then
detected-failure(O, 1)
now process messages from the client *7
time current-time
while (U + 26 < time < (1 + 1)r --H 26) and pstatus = OK
if received request from client c then
response := fl(stateo, request)
stateo := stateo 0 response
pstatus :=broadcast ((log, so, (green, response, c)), 0)
time := current--Htime
coend
procedure detected-failure(k, i)
Isprimary[0, 1] := false
if delivered (takeover,s1, iT + 6) then backup(k, iT + 26)
else failure--Hmod?primary(k, (i + l)T)
Fi ure 6.10: Procedures for receive?mission faiiures(n = 2, f = 1)-i
133
procedure backup(k, switch)
j			k
bstatus			OI?
cobegin
while bstatus = OK
((tag, --H, (color, r, c)), --H) :=Deq(Rqtieue?)
/? synclironizing with new primary ?/
if tag = mylastlog then
if r E state? then
if r = last(state?) then skip
else state? state? \ last(state?)
else state? := state? 0 r
/? logging response from primary */
if tag = log then
state? := state? 0 r
if color = green then send r to client c
/? detecting primary failures */
for 1 := 0 to oc do
when current--time = lr + switch + ?
if not delivered AjI?+?W?tCh then
bstatus			NOTOK
time			current--Htime
fifo--Hsend (init,(takeover,s?, time), 5k, time)
wait until current--Htime = time + ?
failure--Hmod?primary( k, time + ?)
coend
Fi ure 6.11: Procedures for receive?omission failures(n? = 2, f = 1)-ii
134
procedure failure-mode--Hprimary(k,first)
start			current4ime
cobegin
II inform--Hclients( "Dest =
II Isprimary[k,2] true
R (red, tast(state?), --H)
fifo--Hsend(init, (mylastlog, 5k? R), s?, current--Htime)
do forever
when received request from client c sent after start
response := fl(state?, request)
state? := state? 0 response
R := (red,response,c)
broadcast (init, (log, 5k, R), 5k, current--Htime)
send response to client c
od
I for 1 := 0 to oc do
when current-time =first + ir:
fifo-send (init ,Af?irst+Ir
coend
5k, first + ir)
procedure broadcast(M, k, type)
j :=
time := current--Htime
fifo-send(init,M, 5k? time)
if by time + 2? not fifo-delivered (echo,M, sj, time) then
return(NOTOK)
else return(OK)
Fi?ure6.12:Proceduresforreceive?omission failures n? = 2, = 1 -iii
135
message and waits for the corresponding acknowledgement to arrive. If the ac-
knowledgement arrives, the primary processes requests from the clients in the sec-
ond interval.
In a failure-free run of this protocol, since the backup responds to the client,
the primary forwards any response to the backup (with a green tag as shown in
the protocol) and the backup sends this response to the client. However, if there
is a failure, then the primary responds to the clients. In this case, the primary will
forward a response to the backup with a red tag. The backup does not forward a
response to the client if the response has a red tag.
Whenever so receives a request from the client, it computes a response r
changes state, and sends r to 51 with a green tag. Upon receiving this message,
?1 updates its state, acknowledges to so, and then sends r to the client. Because
it is the backup that responds to the client, the protocol is ?--Hblocking. Server ?o
processes a new request only after receiving the acknowledgement from 51 for the
previous request.
Suppose that so does not get s1'? acknowledgement for some response r, and
therefore broadcast returns a status of iYOTOK (the argument is similar if no
acknowledgement is received for an alivemessage). There are three possibilities:
(1) ?1 has crashed, (2) Si omitted to receive the response r and so did not send
the acknowledgement, (3) so omitted to receive the acknowledgement. 5o now
waits until it is supposed to send the next alive message. 5o sends this alive
message and waits for an acknowledgement. We now consider the above three cases
separately.
136
Case 1: 51 has crashed. As a result, 50 will not receive the acknowledgement to
the alivemessage and, therefore, will call procedure detected--Hfailure. Further-
more, since 51 has crashed, ?o will not deliver the takeover message either, and
so s? again becomes the primary by calling procedure failure--Hmode--Hprimary.
From now on, whenever 5o receives a request from a client, it computes the re-
sponse r, sends r to 51 with a red tag, and then sends the response r back to
the client. Also, 5o continues to send alive messages. Since 5o is correct, it can
continue like this forever.
Case 2: 51 is faulty and omitted to receive the response r. By the property of
fifo--Hsend and fifo--Hdeliver in Lemma 5.3.2, 5i will not receive the alivemessage
that 5o sends, and so does not send the corresponding acknowledgement. Server
?1 concludes that 5o has crashed, sends a takeover message to 5o and becomes
the primary by calling failure--Hmode--Hprimary. After that, it behaves like ?o in
case 1 above (including sending alive messages to so). Since ?o will not receive
the acknowledgement (none was sent), it again calls procedure detected--Hfailure.
However, since 5o is correct, it receives the takeovermessage (as opposed to case
1) and so it becomes the backup. Also, since 5o is correct, it will not omit to receive
the response messages that ?1 sends and so 5o keeps its state consistent with si.
Subsequently, if ?o stops receiving alive messages from ?i, then s? has crashed
and so becomes the primary once again by calling failur?rnode--Hprimary.
Case 3: ?o is faulty. Since si is correct, it receives the alive message from
5o, sends the corresponding acknowledgement and remains the backup (as opposed
to case 2). However, by the property of fifo--Hbroadcast and fifo--Hreceive in
Lemma 5.3.2, 5o will not receive this acknowledgement to the alivemessage, and
137
so again calls detected--Hfailure. It will not receive the takeover message either
(none was sent), and so it behaves as in case 1, and again becomes the primary
by calling failure--Hmode--Hprimary. Similar to case 2, si receives all response
messages that so sends and so its state is consistent with so Finally, si becomes
the primary if it stops receiving alive messages from ?o
Case 2 shows the run required by Theorem 10. In this run, the correct primary
?o cedes to a faulty backup si.
6.3.1 Proofof Correctness
?Ve now formally show that the protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies
Pbl--HPb5.
Lemma 6.3.1 Let j = k. If sj fifo-sends m and 5k is correct, then 5k fifo-delivers
Proof: Easy to see because the links are FIFO are there are no send?omission
failures. E
Define:
Prmy?0 = ((pstatus = OK A Isprimary[O, 1]) V Isprimary[O, 2]) A 5o has not
halted.
Prmy51 A Isprimary[1,2] A si has not halted.
Lemma 6.3.2 Let j = k. If s? calls failure--Hmode--Hprimary, then either sj is
faulty (by receive--Homission) or 5k has halted.
138
Proof: Case 1: sj calls failure--Hmode--Hprimary in detected--Hfailure in
Figure 6.10. This has to be the first time 5o does not get b = (echo, (alive, so, iT),
51, ir) at time 1r+2? in Figure 6.10. If b was sent, then by Lemma 6.3.1 ?o is faulty
and we are done. So assume that b was not sent. If 51 halted, then again we are
done. If Si did not halt and did not send b, then again by Lemma6.3.1 ?1 is fauky.
As can been from Figure 6.11, ?1 will send (init,(t?eover,s?,time),s1,time) at
time ir + 6. Since 5o is not faulty (f = 1), it will receive this message by time
ir+26 (Lemma6.3.1), and from Figure 6.10, will not callfailure--Hmod?primary,
a contradiction.
Case 2: sj calls failure--Hmode--Hprimary in backup in Figure 6.11. First
assume that j = 0. so became the backup at time switch = ir + 26 for some
i > 0 after receiving the (init,(takeover,s1, time), ?1, time) message that ?1 sent
at iT + 6. Suppose 5o called failure--Hmod?primary at time t = ir + (ir + 26) +
26 as it did not deliver (alive,s1,lr + ir + 26) from si. From Figure 6.11, ?1
should have called failur?mode--Hprimary at time iT +26, and should have sent
(alive, ?1, ir + ir + 26) for all 1 > 0. Therefore, if so does not deliver this, then
either 5o is faulty, or 51 has halted.
Assume now that j = 1. ?1 became the backup at time switch = 0. Suppose
?1 called failure--Hmod?primary at time t = ir + 26 because it did not deliver
(alive, 51, ir) from 5o, which again implies that either 5o has halted, or ?1 is faulty.
0
The following corollary now follows from Lemma 6.3.2:
139
Corollary 6.3.1 If Isprimary[O, 2] and so has not hatted, then ?Isprimary[l, 2]
or si has halted.
Proof: For contradiction, assume that Jsprimary[O, 2], Jsprirnary[1, 2] and nei-
ther so or s? have halted. However, this then implies from Lemma 6.3.2 that both
?o and ?i are faulty, a contradiction since f = 1 E
Theorem 42 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies Pbl.
Proof: From Corollary 6.3.1, if Isprirnary[O, 2] and So has not halted, then
either ?Isprimary[1, 2] or s? has halted. Therefore, the only way Pbl can be
violated if Isprimary[O, 1] A Jsprimary[l, 2] and neither servers have halted. Sup-
pose they become true for the first time at t. s? would have called failure--H
mode--Hprimary at time t' = lr + 2? for some < t after it failed to deliver
(alive, ?o, lr). Consequently, so will not fifo-deliver (echo, (alive, so, lr), si, ir),
and so set Isprimary[O, 1] to false, at time t' = lr + 2?, which implies that
Jspr?mary[O, 1] = false at time t > t' a contradiction. 0
Theorem 43 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies Pb2.
Proof: Trivially true.
0
Since 5o can become the backup and then become the primary again, in order
to satisfy Pb3, we need some mechanism such that no requests are enquened at so
while it is the backup. So we make the assumption that such a mechanism exists
and discards all requests that arrive at a backup. Pb3 now holds.
140
Theorem 44 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies Pb3.
Lemma 6.3.3 Let j = k. Jf ?t such that Jsprimary\j, 2] for the first time at t
and ??t' < t such that Jsprirnary[k, 2] at t', then Isprimary[k, 2] cannot become
true before time t + 6.
Proof: If 5k has halted before t, then we are done. So assume that this is not
the case.
Case 1: s? calls failure--Hmode--Hprimary in detected--Hfailure. This has to
be the first time so does not get b = (echo,(alive,50,17),51,17) at time t = 17+26.
If b was sent, then si cannot set Isprimary[1, 2] before time (1 + 1)7 + 26 > t + 6
as 7 > 46. If b was not sent, then since 51 has not halted by time ir + 6, it is
faulty by Lemma 6.3.1 and will send (init,(takeover,s1,time),s1,tirne) at time
1r + 6. Since so is not faulty (f = 1), it will receive this message by time 17 + 26
(Lemma 6.3.1), and from Figure 6.10, will not call failur?mode--Hprimary, a
contradiction.
Case 2: 5j calls failure--Hmod?primary in backup. First assume that
j = 0. so became the backup at time switch = ir + 26 after receiving the
(init,(takeover,s1,time),s1,tirne) message that si sent at iT + 6. Suppose so
called failure--Hmod?primary at time t = 17+ (i7 +26) + 26 as it did not deliver
(alive,si,lr+ir+26) from si. Since Isprirnary[1,2] is not true before t, si must
have halted by time iT + 26 < t, a contradiction since we have assumed that that
si has not halted before time t.
Now assume that j = 1. Ifs1 calls failure--Hmode--Hprimary at time t = ir+26,
then so cannot call failur?mode--Hprimary before time (i7 + 26) + 26 > t + 6 0
141
Let k = 5. Define:
q[O] =
q[1] = 5? if ?t such that Isprimary[j, 2] at t and ?at' ? t such that Isprimary[k.. 2]
at t'
= I otherwise.
q[2] = 5k if q[l] = sj and 3t such that Isprimary[k, 2] at time t
= I otherwise
From Lemma 6.3.3, q[1] is well defined.
Lemma 6.3.4 Let ? 0 be the smallest of the times that q[O] halts? or pstatuso =
NOTOK or Isprimary[0, 1] = false. Then q[l] = 5k fork c fO, 11 and Isprimary
[k, 2] is true by time ? 0 + 2r.
Proof:
Case 1: 5o halts at time ? 0, where (l--H1)r ? ? 0 < lr. Then 51 will not deliver
(alive, 80, lr) and become the primary by time lr + 2? < ? 0 + r + 26 < ? 0 + 27
since r > 46.
Assume now that so does not halt at time ? 0.
Case 2: pstatuso = NOTOK because so did not fifo-deliver (echo, Al, si,--H)
for some M by time ? 0, where (1 --H 1)7+46 < ? 0 < 17. From the property of fifo-
send and fifo-deliver, ?o will not fifo-deliver (echo, (alive, so, 17), ?i, 17). Now if ?i
did not fifo-send (init,(takeover,s1,l? + 6),si,--H), then either Isprimary[0,2]
becomes true by time 17+26< ? 0+7--H26< ? 0+27, or s? halts by time 17+26
and Isprimary[1, 2] becomes true by time (1 + 1)7 + 26 < ? 0 + 27 --H 26 < ? 0 + 27.
142
On the other hand, if 51 did fifo-send (init, (takeover, 51, Ir + 6), 51,--H), then
either Jsprimary[l, 21 becomes true by time lr+26 < ? O+r--H 26 < ? 0+2r, or ?1
halts by time lT+26 and Isprimary[O,2] becomes true by time r+(lT+26)+26 ?
? 0 + 2r.
Case 3: Isprimary[O, 11 = false at time ? 0 = iT + 26 because ?o did not
fifo-deliver (echo, (alive, ?0, IT), ?1, ir). The rest of the argument is now similar
to case 2.			E
Lemma 6.3.5 Suppose q[l] = 51 and ?1 become the pfl'mary at time t 1 = lr+26
If ?1 halts at time ? 1, then
1. so would have called backup at time ? 1 and
2. so cannot set bstatus0 = NOTOK before time ? 1 + 6.
Proof: Since 51 halts, it is faulty and so 5o is correct. so would not have fifo-
delivered (echo, (alive, 50, lr), 51, ir) and would have delivered (takeover, 51, lr+
6). Thus so would have called backup at time $ 1 = lr+26. Now let ir+lr+26 <
? 1 < (i+1)T+lT+26. so wouldhavedelivered(alive,s1,i'r+lr+26) for all i' < i
and so cannot set bstatus0 = NOTOK before time (i + l)r + ir + 36 > ? 1 + 6. E
Lemma 6.3.6 Let q[1] = s?. If sj halts at time ? j, then q[2] = 5k, k = 5, and
Isprimary[k, 2] = true by time ? j + r + 26.
Proof: Case 1: j = 0 and let (1 --H 1)r < ? j < lr. Then ?1 cannot halt
= 1) and will not fifo-dejiver (alive, ?0, ir) and so become the primary by time
ir + 26 < ? j + T + 26.
143
Case 2: j = 1 and si called failure--Hmodeprimary at time 17 + 2?. From
Lemma 6.3.5, ao would have called backup by time 17 + 26. Now let i7 + 17 + 26 ?
I j < (i + 1)7 + ir + 26. Then S? will not deliver (alive, si, (i + 1)7 + 17 + 26) and
become the primary by time (i + 1)7 + 17 + 46 < I 5 + 7 + 26. c
Let k = [?;41??and p = 4k6 + 2n?6 + &
Lemma 6.3.7 If Isprimary[O, 1? = true and pstatuso = OK and so has not
halted through time t, then there can be at most one request from any client i in
the queue of so at time t.
Proof: Assume, for contradiction, that t2 ? t is the first time that there are
two requests (say mi and m?) from some client i in the queue of 5o Assume mi
was sent at time ti and so it was enqueued by ti + 6. Furthermore, since m2 is
the first duplicate request, there are at most ?c --H 1 requests ahead of mi in so's
queue. From Figure 5.2, 2p = 2(nc26 + 6 + 4k6) and so client i cannot send m2
before time t1 + 2(nc26 + 6 + 4k6) i.e. t > t2 > ti + 2(n?26 + 6 + 4k6). From
Figures 6.10 and 6.12, 5o will dequeue mi by time t1 + 6 + (nc --H 1)26 + 4k6 < t.
(The last term comes about because in procedure primary only 7 --H 46 every 7 is
used for processing requests.) Therefore, t2 < ti + 6 + (nc --H 1)26 + 4k6, which is
a contradiction.			E
Lemma 6.3.8 If a request m is sent by the client at time tm to server 5o, and
Jsprimary[O, 1J = true and pstatuso = OK through the interval [tm..tm + p], then
ifs0 has not halted by time t? + p then m is received by so by time tm + p
144
Proof: Since the message delivery time is at most 6, m is enqueued at so by
time tm + 6. From Lemma 6.3.7, there can be at most nc --H 1 requests ahead of
m in ?o'? queue. Therefore, from Figures 6.10 and 6.12, ni is received by time
tm + 6(nc --H 1)26 + 4k6 ? tm + p, and the lemma holds. c
Lemma 6.3.9 Iflsprimary[j, 2] for the first time att, then there is a time ? 0< t
such that 5o halts at ? 0, or pstatuso = AOTOK at time ? 0 or Isprimary[0, 1] =
false at time ? 0.
Proof: Trivially true if j = 0 because 5o sets Isprimary[0, 1] = false be-
fore calling failur?mode--Hprimary. Now assume that j = 1. If q[l] = si
and si calls failure--Hmode--Hprimary a time t = lT + 26, then ?o could not
have fifo-delivered (echo, (alive,5o,ir), si, ir) and would have either halted or
set Isprimary[0, 1] = false by time t. Similarly, if q[2] = si, then q[1] = 5o and
5o would have set Isprimary[0, 1] = false by time t. 0
Lemma 6.3.10 Let q[1] = si and let ? 0 be the smallest of the times that so halts
or pstatuso = ATOTOK or Isprimary[0, 1] = false (this is defined by Lemma
6.3.9). Let Final0 = .0... ?? and let m? be the request corresponding to ??. If
?? was not sent to the client, then m? must have been received in the interval
[max(O, ? 0 --H 26)..? 0] and sent in the interval [max(0, ? 0 --H p --H 26)..? 0].
Proof: Suppose in? was sent before time ? 0 --H p --H 26. Then by Lemma 6.3.8,
it must be received at time t' < ? 0 --H 26. Let lr + 26 < < (1 + 1)r --H 26. From
Figures 6.10 and 6.11, si can set bstatus1 = iVOTOK earliest by time (1 + 1)r + 6
because it would have delivered (alive, ?o, l'r) for all 1, < 1.
145
If ?1 fifo-sent (echo, (log, so, rn), ?1, --H) by time t' + ? < (1 + l)r + ?, then 51
would have sent r? to the client, a contradiction.
Ifs1 did not fifo-send (echo, (log, ?o, r?), 51,--H), then ?o would have set pstatuso
= NOTOK by time t' + 2? < ? 0, again a contradiction.
Similarly,if we assume that m? was received at time t' ? ? 0 --H 2?, then this is
a contradiction as proved in the previous paragraph. E]
Lemma 6.3.11 Let q[1? = so and let ? 0 be the smaller of the times that pstatuso =
NOTOK or Jsprimary[0, 1] = false (this is defined by Lemma 6.3.9). Let Final0 =
.0... ?? and let m? be the request correspondin? to ??. If ?? was not sent to the
client, then m? must have been received in the interval [max(0, ? 0 --H r)..? 0] and
sent in the interval [rnax(0,? 0 --H T --H p)..? 01.
Proof:
Suppose m? was sent before time ? 0 --H r --H p. Then it must have been received
at time t' < ? 0 --H r. Let lT + 2? < t' < (1 + 1)r --H
Case 1: ?o sets pstatuso = NOTOK at time ? 0 as it did not fifo-deliver
(echo, (log, so, (green, ??? --H)), s1, --H). If m? was received at time t' < ? 0 --H
then so would have set pstatuso = NOTOK by time t' + 2? ? ? 0 --H r + 26 < ? 0
because T > 46, a contradiction. Similarly, if we assume that m? was received at
time t' < ? 0 --H T, then this is a contradiction as proved earlier.
Case 2: so sets Isp.imary[0, 1] = false at time ? 0 because it did not fifo-
deliver (echo, (alive, so, I1r), ?1, lr). We show that 1' = 1 + 1. Since ?? was not
sent to the client and so fifo-delivered (echo, (log, so, (green, ??? --H)), 51,--H), 51
must have halted by time t' + 6 < (1 + 1)r. Therefore, ?o cannot fifo-deliver
146
(echo, (alive, so, (1 + l)r), 51,--H) and so 1' = 1 + 1. Therefore, ? 0 = (1 + 1)7 + 2?
and since ir + 26 < ??, ? 0 --H r ? t? and we are done. Similarly, if m? was sent
before ? 0 --H r --H p, then it must have been received at time t' ? ? 0 --H r which is
a contradiction because ? 0 --H r
Lemma 6.3.12 Let q[lj = 5k call failure--Hmode--Hprimary at time $ k and halt
at time ? k. Let Final? \ Init? = ..... r? and let m? be the request corresponding
to r?. If r? was not sent to the client, then m? must have been received at time
? k and sent in the interval [max($ k, ? k --H p)..? k].
Proof: Easy to see because 5k would have halted just after receiving Tn?. C
Lemma 6.3.13 Suppose q[1' = si and si becomes the primary at time $ 1. Let
? 0 < $ 1 be the smallest time such that s? halts, or pstatuso = A?OTOiC or
Jsprimary[0, 1] = false (this is defined by Lemma 6.3.9). Then
1. mit1 ? Final0
2. If Final0 = iklo and R0 ? e, where R0 E AI0, then R0 was sent to the client.
3. If Initi E Final0, then last(Finalo) was not sent to the client.
Proof: Let $ 1 = lr + 26 and let Final0 = .... . r?. Furthermore from the def-
inition of q[1], so could not have called failure--Hmode?primary before time $1.
Since ? 0 < lr+26, so would have sent sequence of broadcasts (log, 5o, (green, r?, --H
)), i = ... . n--H1 by time lr--H26 and fifo-delivered (echo, (log, so, (green, r?, --H)), s?,
--H). Therefore, si would have delivered (log, ?o, (green, r?, --H)) by time lr --H 6 and
147
(1) now follows from Lemma 6.1.1 and the fact that bstatus1 = NOTOK only at
time iT + 6. (2) is true as 51 would have sent r?, i = .... n --H 1 to the client by
time lT --H 6. (3) is trivially true because only 5i can send r? to the client.
Lemma 6.3.14 If q[2] = si, then q[l] = So must have called failure--Hmode--H
primary in detected--Hfailure.
Proof: For contradiction, assume that 50 called failure--Hmode--Hprimary in
backup at time t = iT + (lr + 26) + 26. Since q[1] # si, 51 did not call failure--H
mode--Hprimary at time ir + 26 and so must have halted by time iT + 26 < t, a
contradiction since si calls failure--Hmode--Hprimary after t. E
Lemma 6.3.15 Let q[2] = 51 and si becomes the primary at time $ 1. Let ?o call
failure--Hmode--Hprimary at time $ 0 and ? 0 > ? 0 be the greatest time s? was the
primary. Then
1. mit1 E Final0.
2. If Final0/mit0 = 1W0 and R0 # e, where R0 E M0, then R0 was sent to the
client.
3. if Init? E Final0, then last(Finalo) was not sent to the client.
?. If Final0 = Inito then Init1 = Final0
Proof: From Lemma 6.3.14, 50 must have called failure--Hmode--Hprimary in
detected--Hfailure and so ? 0 = lr + 26 for some 1> 0 and first = (1 + 1)T. s? is
faulty and must have halted at time ? 0. Let (i + l)r < ? 0 < (i + 1 + 1)T. Since si
148
is correct, it would have delivered (alive,so,j'r) for all 5! < i + 1, and so cannot
set bstatus? = NOTOA before time (i + 1 + 1)r + ? ? 0 + ?. Consequently,
t 1> I0+2?.
Let Init0 = .0... ??. ?o must have sent sequence of broadcasts (log, ?o, (green.
?j, --H)), i = ... . n by time ir --H 2? and since ?1 is correct, it would have delivered
these in the right order by time ir + ?. Therefore statej($ 0) = .... . r?. Fur-
thermore, the delivery of (if ever delivered) b = (mylastlog, so, (--H,rn,--H)) will not
change statei. This satisfies (4). Now let Final0 = .... . r? 0 ?n+1 ...
Since 5o would have broadcasted (log, so, (red, r?, --H)), n + 1 < i < m by time
? 0 and since 51 is correct and does not set bstatusi = NOTOK before time ?
(1) follows. Also (2) follows from Figure 6.12.
(3) can also be seen to be true because if ?1 did not deliver (log, So, (red, ?m,
then ?o could not have broadcasted (log, so, (red, ?m, )), and consequently from
Figure 6.12, could not have sent ?m to the client. 0
Lemma 6.3.16 Suppose q[2] = so and q[1] = ?1 Let ?1 call failure--Hmode--H
primary at time ? 1 and ? 1 is the greatest time that ?1 is the primary. Let
Isprimary[O, 2] = true for the first time at ? 0. Then zf sf(s1) then
1. mit0 E Final1.
2. If Final1/mit1 = iNil and R1 # e, where R1 E M1, then R1 was sent to the
client.
3. If Inito E Final1, then last(Finali) was not sent to the client.
?. If Init? = Final1 then Inito = Init?
149
Proof: Let $ 1 = ir + 26. From Lemma 6.3.5, 50 would have called backup by
time $ 1 and cannot set bstatus0 = NOTOK before time ? 1 + 6.
From Lemma 6.3.13, Init? E stateo($ 1). By Lemma 6.3.2 si is faulty and so
5o is correct. By assumption, si makes a successful broadcast of
b = (rnylastiog,si,(--H,last(Initi),--H)) and ?o will deliver this by time ? 1 + 6,
making the two states equal. This satisfies (4). Now let Final1 = Initi oro...
Therefore, from Figure 6.12, si would have successfully broadcast (log, si, (red, ?j,
--H)), i = ... . n --H 1, by time ? 1 and s? would have delivered these after b (since 5o
is correct) by time ? 1 + 6. This satisfies (1). (2) and (3) can be seen to be true
from Figure 6.12.			E]
Lemma 6.3.17 Suppose q[i] = sj, where i > 0 and sj calls fai1ure--Hmod?
primary at time $ 5. Furthermore, if i = 2, then q[1] did not initiate a successful
broadcast of the form (mylastlog, q[1], --H). Let ? 0 < $ 5 be the smaller of the
times that s? halts, or pstatuso = NOTOK or Isprimary[0, 1] = false (defined
by Lemma 6.3.9). Then
1. mit1 E Final0.
2. If Final0 = M0 and if R0 # e, where R0 c M0 and 1M01 --H RoI = 2, then R0
was sent to the client.
3. If Init? E Final0, then last(Finalo) was not sent to the client.
Proof: Let Final0 = r.... ?n?2 0 ?n?i 0 ??. (2) is true because .0... ?n?2 must
have been sent to the client by si. We now prove (1) and (3).
150
Case 1: i = 1. If j = 1, the result then follows from Lemma 6.3.13. On the
other hand, if j = 0, the result trivially follows.
Case 2: i = 2. If q[1] = 5j, then si can initiate only one broadcast of the form
(mylastlog,s?,(--H,r?,--H)), where k c (n--H1,n? (from Lemma6.3.13). Ifq[2] (i.e.
so) ever delivers this, then (1) is satisfied (from Lemma6.3.5 and Figure 6.11). The
precondition of (3) can be true iff ?1 broadcasted (mylastlog, 51, (--H, r??1, --H)) (as
so has to deliver this). This implies that 51 did not deliver (log, so,
and so r? could not have been sent to the client.
If q[l1 = so, then 5o would have called failure--Hmode-primary at time t' =
lT + 26 in detected--Hfailure (by Lemma 6.3.14) and so first = (1 + l)r. Also,
stateo(t') = Final0 = .0... r?. By Lemma 6.3.2 so is faulty and must have halted
at time t' because it did not initiate a successful broadcast at time i'. 5o must have
sent sequence of broadcasts (log, ?o, (green, rj, --H)), i = .... n, by time lr --H 26
and since ?1 is correct, it would have delivered these in the right order by time
Ir + 6. Therefore statei(t') = .... . ??. Furthermore, the delivery of (if ever deliv-
ered) (mylastlog, so, (--H, rn, --H)) will not change state? (from Figure 6.11) and (1)
is satisfied. (3) is trivially true because Initj = Final0. c
Lemma 6.3.18 Suppose q[i] = sj, where i > 0 and sj calls failure--Hmod?
primary at time ? j. Furthermore, q[ij initiates a successful broadcast of the
form (mylastlog, q[i], --H) and q[i] is the nth server, n C (1,21 to initiate such a
successful broadcast. Then
1. there is a run ap of a (p,n + 1,4r + 36 + p) bofo server P such that Vk
ap ?
151
2. tjEff(ap,n+l)andlff(0'p,n+l)I<4T+26.
3. statep(ap[t j]) = Initj
4. Vk : queuec?,p(o'p[t 11) =
5. Response r is sent to client ck in ap iff r is sent to client ck in a[t il.
Fufthermor? if r is sent at time t in ap, then r is sent at time t' in o'[t j],
where t ? t' < t + 6.
Proof: Case 1: n = 1. Let t be the smaller of the times when so halts, or
pstatuso = NOTOK or Isprimary[0, l? = false (defined by Lemma 6.3.9).
Assume that all the requests m that 5o responds to in the interval [0. .t] are
responded to in ap as well. m is received at the same times in o'PB and ap.
However, rn is enqueued at the same time it is received in ap (this has to be true
of all runs of P). We can do this from Lemma 6.3.8. Furthermore, in ap the
response r has to be sent at the same time the request m is received, whereas in
o'PB, r might be sent after 6 time units by si (this satisfies (5)). However, r has
to be received at the same times by the clients in o'PB and ap. We can do this
because r is received by the client within 26 after m is received, and p > 26.
Let stateo(t) = ro 0?1... r??i 0 r?. Let rn??1 and nt? be the requests cor-
responding to Tn?1 and r? respectively, which were received at times tn?1 and
t? respectively. Let the two outage intervals of ap be li = [tn?i..tn?i] and
`2 = [max(0,t --H T)..tl. From Lemma 6.3.17, Init? = R = ro or1... r??1 0 r??1
or mit5 = = ro 0 r1 . . . r??i 0 r? and ro 0 r1 . . . ?n?2 must have been sent to
the client. Therefore, from construction, these responses are sent in ap as well.
If r??i was not sent to the client, assume in ap that mn--Hi was received, but a
152
response never sent (outage belongs to li). If Init5 = 1?' then r? was not sent to
the client (from Lemma 6.3.17)and from Lemmas 6.3.10 and 6.3.11, Tn? must have
been sent in the interval [max(O, t T --H p)..t] (r > 2?). Assume in ap that Tn?
is never enqueued (outage belongs to 12) or hasn't yet been enqueued, depending
on the time m? is actually sent. We show that P receives no further requests
until time t j. Extend 12 as 12 = [max(O,t --H r)..t j]. Therefore, 1121 < 4T + 2?
(using Lemmas 6.3.4 and 6.3.6 one after the other). This satisfies (2). By con-
struction, all requests (other than m?) that were sent before ? j and were received
in ap?[t j] have already been received in ap. Furthermore m? is never received in
ap. All requests that were sent before ? j and are not received in ap?[t j] must
have been sent in the interval [t --H p. .$ jj (Lemma 6.3.8). Assume, therefore, that
in ap these requests are never enqueued (outage belongs to 12) or haven't yet been
enqueued. Therefore, we have shown that P receives no further requests until t j
and so mit5 = statep(t j) satisfying (1) and (3).
On the other hand, if Init5 = Al and if r? was not sent to the client, then
assume in ap that m? was received, but a response never sent. By Lemmas 6.3.10
and 6.3.11, this outage will again belong to 12. The rest of the argument then
remains as the previous paragraph, satisfying (1), (2) and (3).
We finally show that Vk : qUeuec??p(t j) = ?. By the above construction of
ap, all requests sent before $ j, other than m?, that are received in ap?[$ j] are
received in ap as well. Also, m? is not enquened at time $ j. Finally, all requests
sent before $ j that are not received in ap?[$ jj must must have been sent in
--H p..$ j] (by Lemma 6.3.8). Assume that these are never enquened in ap (again
outages lies in 12) or haven't yet been enqueued. Thus, Vk: qUeUec?,p($ j) =
153
Case 2: n = 2. From case 1, the lemma is true when q[l] = 5k calls failure--H
mode--Hprimary at time t k and so there exists a run a?1, such that V? :
o[t k]. Let 5k halt at time ? k. ?Ve now extend a?1. Extend the interval 12 by
6 + p. All requests sent in the interval [$ k --H p.. t k] that haven't been enqueued in
are never enqueued in ap. Futhermore, all requests sent in [? k. . t k + 6] that
are not sent to 5k are never enqueued in ?p. All these outages lie in 12. ?Ve now
extend a'? to ?p until t j using an argument similar to case 1.
Assume that all the requests m that s? responds to in the interval [t k. . $
are responded to in ap as well. rn is received at the same times in ap? and ap.
Furthermore, in ap the response r is sent at the same time it is sent in ap? (this
satisfies (5)) and r is received at the same times by the clients in both apB and
ap.
Define 13 = [? k..t il. Thus by Lemma 6.3.6,1131 < r + 26, which satisfies (2).
Using Lemmas 6.3.15 and 6.3.16, Jnitj E Fina1?. Let r? = last(Finat?/Init?)
and m? be the request corresponding to r?. By an argument similar to case 1, we
can show that Initj = statep(t j) with the outage corresponding to m? (if there
is one) lying in the interval 1a, which satisfies (1) and (3). Furthermore, we can
show that Vk : queuQ?,p(t j) = ?, satisfying (4).
Lemma 6.3.19 For any run aPB of the primary-backup system there exists a run
ap of the (p, 3,4r + 36 + p) bofo server P such that Vk: ap Nk apB.
Proof:
Case 1: q[i] =ffi for all i > 0. Therefore, from Lemma 6.3.4, so always remains
the primary. If so makes infinitely many broadcasts in ap?, then from Figure
154
6.10, all requests have responses. Therefore, assume that all the requests in that
?i responds to are responded to in ap as well. in is received at the same times in
?PB and ap. However, in is enqueued at the same time it is received in ap (this
has to be true of all runs of P). We can do this from Lemma 6.3.8. Furthermore,
in ap the response r has to be sent at the same time the request in is received.
whereas in ap?, r might be sent after ? time units by si. However, r has to be
received at the same times by the clients in ap? and ap. We can do this because
r is received by the client within 2? after in is received, and p > 2?. However, if s?
makes only finitely many broadcasts, then again from Figure 6.10, the last request
might not have a response. Assume that the last request in? is received by so at
time t. Let the first (and only) outage interval of ap be [t. .t] and assume in ap
that the request in is received, but a response never sent.
Case 2: q[i] = sj for some i > 0. Furthermore, let q[i] be such that q[i? never
halts in apB. There has to exist such an i <2 from Lemma 6.3.6 and the fact that
f = 1. Therefore, from Figure 6.12, q[i] initiates a successful broadcast of the form
(mylastlog, q[ij, --H) at time ? j and from Lemma 6.3.18, there exists a run a?' of
the (p, 3, 4T + 3? + p) bofo server P such that Vk: a?' N?k ap?[$ j]. It is now easy
to extend a'? to ap using an argument similar to case 2 of Lemma 6.3.18. E
Theorem 45 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies Pb4.
Proof: Follows from Lemma 6.3.19.
E
Theorem 46 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 satisfies PbS.
155
Proof: In fai1ur?mode--Hprimary, a server remains the primary until it halts.
Futhermore, so stops being the primary in procedure primary only if it omits to
receive a message from si. This, however, implies that either 5o or ?i must have
failed.			0
Theorem 47 The protocol in Figures 6.9, 6.10, 6.11 and 6.12 is 6-bThcking.
0
Proof: Follows from the description of the protocol.
Chapter 7
Implementation Results
As we saw in the preceding chapters, 0--Hblocking primary-backup protocols can
theoretically achieve the smallest possible response time. Unfortunately, these pro-
tocols cannot be constructed for some kinds of failures, for example send?omission
and general-omission failures. We, however, show in this chapter that 0--Hblocking
protocols are still very useful as they can be constructed for the kinds of failures
that are expected to occur in many practical primary--Hbackup systems. We also
implement two of our 0--Hblocking protocols, and compare the performance of these
protocols with 26--Hblocking protocols.
The chapter is organized as follows. Section 7.1 discusses the failure assump-
tions of our implementation. Section 7.2 describes our implementation, and Sec-
tions 7.3 and 7.4 discuss our results. We conclude the chapter in Section 7.5.
156
157
7.1 Failure Assumptions
NVe implement 0-blocking protocols under two kinds of failures: crash failures and
receive--Homission failures.' Crash failures are a reasonable assumption in a system
where network partitions and message losses are highly unlikely. Receive?mission
failures are a reasonable assumption in a system where partitions are unlikely,
but it is possible that some servers may become overloaded and miss messages, or
message buffers at these servers may become full and drop messages.
Even though the 0--Hblocking protocols that we implement can theoretically ac-
hieve the smallest possible response time, it is not clear whether the above re-
stricted set of failures is realistic in practical systems. However, we believe that
many primary--Hbackup systems will only have to tolerate a sufficiently restricted
set of failures to allow our 0--Hblocking protocols to be constructed. Given no extra
constraints, a practical primary--Hbackup system should have all servers on a single
local area network. This is because the time required between the failure of a pri-
mary and the takeover by a backup is determined by the bandwidth between the
primary and the backups. Using a single local area network makes partitions that
separate the servers unlikely. Other failures can still occur due to process crashes
and message losses. However, the kinds of message losses that are expected to oc-
cur on this type of network are restricted and correspond to our receivenmission
failure model. According to [ADKM92], as technology improves and newer, faster
networks (such as FDDI) are used, there will be mainly the following two causes
for message losses on a local area network:
`As we saw earlier, 0--Hblocking protocols can be constructed for crash+Iink failures as well.
However, to get meaningful results for this model, we would need multiple independent links
between servers, which were not available.
158
1. failure to intercept messages from the network at high transfer rates due to
interrupt misses;
2. buffer overfiows at the receiver.
As can be seen, these set of failures corresponds to our receive?mission failure
model, and therefore one can construct a 0--Hblocking primary-backup protocol for
such a system.
7.2 Description of the implementation
The two 0--Hblocking protocols that we implement are described in Chapter 5. ?Ve,
however, extend these protocols so that crashed servers can be re-intergrated. Us-
ing these primary--Hbackup protocols, we implement a service that maintains a fauft-
tolerant counter. Clients request a counter value from the service, and on receiving
such a request, the service responds by sending the current counter value to the
client and incrementing the counter. Informally, we require the counter service to
satisfy the following three properties:
1. Two different requests cannot return the same counter value.
2. No counter value can be skipped.
3. Let request R1 be sent by client ci and request R2 be sent by client C2, where
ci may be the same as C2. If the response to R1 was received by ci before R2
was sent by C2, then R2 cannot return a smaller counter value than the one
returned by R1.
This is a simple service, yet it is a practical one; for example, such a service
is a central component of a multicast transport protocol designed by Apple and
159
Xerox [AFNI92]. Furthermore, even this simple counter service is not easy to
implement because the counter has to be replicated across servers, and the various
copies of the counter have to be kept consistent with each other. Our primary-
backup protocol, therefore, was not simplified by using a simple service semantics.
Also, since we only analyze the tradeoffs inherent in primary--Hbackup protocols (and
not the tradeoffs due to a particular service), we believe our results are applicable
to other service semantics as well.
In order to compare the response time of our 0--Hblocking protocols with the re-
sponse time of 2?--Hblocking protocols, we also implement a simple 2?--Hblocking pro-
tocol that tolerates server crashes and transient network partitions. This protocol
has the same primary and backup procedures as in Chapter 5. The implemen-
tation of the procedures broadcast, delivery--Hprocess and failure--Hdetector is
informally described below, and the implementation of the procedures initialize
and deliver is the same as in Figure 5.5 for crash failures.
broadcast: The procedure first sends the message to be broadcast to all the back-
ups, and then waits for these messages to be acknowledged. Either all of the
backups acknowledge or a timeout occurs. If a timeout occurs, then another copy
of the message is sent to the backups that did not acknowledge the first message.
Such retries happen for some fixed number of times. If some backup sj still does
not acknowledge, then it is assumed that Sj has crashed as we are assuming that
network failures are transient.
delivery--Hprocess: Similar to the implementation for crash failures in Figure 5.5,
this delivers any new message that is received. In addition, an acknowledgement is
sent to the sender of the message. Note that in failure--Hfr?? fUH?, thi? `iliplernenta-
160
tion is more efficient in the number of messages sent than the implementation for
receive--Homission failures however, it is less efficient than the implementation for
crash failures.
failure--Hdetector: Similar to the implementation for crash failures in Figure 5.5,
except that an alive message may need to be sent multiple times (as in the
implementation of broadcast) in order to tolerate network failures.
In our implementation, we assume that there is a known upper bound on the
message delivery time and that the servers have approximately synchronized clocks
[Cn89,KO87,LMS85J. The implementations were written in C and run on RS/6000
550s with AIX Version 3.1. The machines were connected over a 16 Mbit/s token
ring and messages were sent using UDP sockets.
7.3 Results
In this section, we analyze the tradeoffs in failure--Hfree runs between 0--Hblocking
and 2?--Hblocking protocols with respect to the response time. In the next section,
we discuss how these results can be extended to runs in which failures do occur.
We first show our results for the protocols that use point-to--Hpoint communi-
cation, and then show the corresponding results for hardware broadcasts. Due to
space limitations, we only present a subset of our experiments. Our conclusions in
the chapter, however, are consistent with the experiments that we have omitted.
7.3.1 Communication using point--Hto--Hpoint messages
In the following sections, we plot the average response time of the protocols versus
the degree of replication, while keeping other parameters constant. The maximum
161
degree of replication that we consider is 5 and the maximum number of clients that
we consider is 2. This is because we had only a limited number of machines on
which we could run our tests. However, for each figure we discuss how we expect
the results to extend to higher degrees of replication and to more clients. Also,
in these figures, we only show the response time for odd values of n for receive--H
omission failures, where n is the degree of replication. This is because, as shown in
Chapter 4, the 0--Hblocking receive?omission protocol can only tolerate up to [?fl2] --H 1
server failures. Finally, in order to control the frequency of requests that are sent
by the clients, we add a parameter to the client protocol. We call this the compute
time, and it corresponds to the amount of time a client waits after receiving the
response to a request before it sends out the next request.
Figure 7.1 shows ?he average response time (in the absence of failures) for the
three protocols when there was a single client with a compute time of 10 ms. As a
comparison, we have also shown the response time in the absence of any replication
?.e. n = 1. As can be seen from the figure, the two 0--Hblocking protocols can achieve
a small response time for all degrees of replication. However, surprisingly this was
no longer true in Figure 7.2 where we have added one more client to the system.
The above behavior occurs because of the following reason. As shown in Chap-
ter 5, the receive?mission protocol generates a large number of messages. On
a shared medium like the token ring, these messages can interfere with future
requests sent by the clients, thereby increasing the average response time. In Fig-
ure 7.2, we see that the interference was large enough to make the response time
of the 0--Hblocking receive?mission protocol greater than the response time of the
26--Hblocking protocol. In fact, Figure 7.3 shows a run in which we have increased
162
16
14
Response 12
Time			10
(in ms)			8
6
18
crash ?
receive-omission I
26-blocking El
2
1			2			3			4			5			6
Number of servers
Figure 7.1: Avg. response time vs. degree of replication. Number of clients = 1
and compute time = 10 ms
the compute time to 30 ms. In this case, since fewer requests are being sent per
unit time, interference is less likely, and the average response time becomes smaller
again.
To verify that the behavior of the receive?mission protocol in Figure 7.2 is
due to this interference, we changed the service slightly so that the clients could
send query requests in which the clients just queried about the current value of
the counter. When the primary receives such a query request, it does not have to
inform the backups about any update (there is none) and can immediately respond
to the client. Thus, if the clients send a large number of query requests, then the
average number of messages sent per request are reduced, reducing the interference.
The result is shown in Figure 7.4. As in Figure 7.2, there are two clients, each with
a compute time of 10 ms. However, in Figure 7.4, half of the requests sent by
the clients are query requests. As can be seen, the average response time is now
163
16
14
Response 12
Time			10
(in ms)			8
6
18
crash ?
receive-omission I
2?-block' g E
2			I			I
1			2			3			4			5			6
Number of servers
Figure 7.2: Avg. response time vs. degree of replication. Number of clients = 2
and compute time = 10 ms
smaller than in Figure 7.2.
We now attempt to extrapolate the results in this section to higher degrees of
replication and to more clients. Clearly more experiments are needed in order to
verify these conclusions. We expect the graphs of the protocol for crash failures
and the 26--Hblocking protocol to remain essentially the same, with response time
increasing more or less linearly with the number of servers becasue both of these
protocols only send a linear number of messages per request. Furthermore, the
average response time of the crash failure protocol should always remain less than
the response time of the 26--Hblocking protocol, because the crash failure protocol
never sends more messages than the 26--Hblocking protocol.
On the other hand, if we add more servers in Figure 7.1, then we can expect
the response time of the receive?mission protocol to eventually become greater
than that of the 26--Hblocking protocol because more servers imply more messages
164
18
16			crash ?
14			receive-omission I
Response 12			2?--HbIocking E
2
1			2			3			4			5			6
Number of servers
Figure 7.3: Avg. response time vs. degree of replication. Number of clients = 2
and compute time = 30 ms
per request, which then implies more interference. Furthermore, the larger the
number of clients, the earlier the two graphs will intersect. However, if we lower
the frequency of the requests (as in Figure 7.3), then we can expect the interference
to become smaller again, which will move the intersection point to higher degrees
of replication.
7.3.2 Communication using hardware broadcasts
In the previous section, all communication between servers was through point-
to-point messages. However, most local area networks allow for more efficient
communication by using hardware broadcasts. We therefore reimplemented all our
of protocols to use broadcast communication wherever possible.
The new experiments corresponding to the parameters in Figures 7.1 and 7.2
are shown in Figures 7.5 and 7.6 respectively. In both figures, the graphs of the
two 0--Hblocking protocols are almost indistinguishable. As can be seen again, a
165
16
14
Response 12
Time			10
(in ms)			8
6
crash ?
receive-omission
26--Hblocking 0
2			I
1			2			3			4			5
Number of servers
6
Figure 7.4: Avg. response time vs. degree of replication with query requests.
Number of clients = 2 and compute time = 10 ms
very small response time can be achieved by using 0--Hblocking protocols. Also, as
expected, the response time of most of the protocols is smaller than the corre-
sponding response time for point--Hto--Hpoint communication because fewer messages
were being generated (one message instead of n messages in most cases). A smaller
number of messages means that there is less interference. The only exception was
the crash failure protocol with n = 2. This was because only one message was
being sent (to the lone backup) per request, both in the point--Hto--Hpoint and the
broadcast case. However, when this message was sent using a hardware broad-
cast, the primary received its own message, which slowed down the subsequent
response. However, what surprised us more was that the 0--Hblocking protocol for
receive?mission failures gave a flat curve--Hthe response time was independent of
the number of servers. We expected such a behavior for the crash failure protocol
because in this case, the primary sends just one broadcast message per request,
166
18
16
14
Response 12
Time			10
(in ms)			8
6
4.
2
crash ?
receive-omission I
2?--Hblocking 0
1			2			3			4			5			6
Number of servers
Figure 7.5: Avg. response time vs. degree of replication. Number of clients = 1
and compute time = 10 ms
independent of the number of servers. In the receive?mission protocol, however,
each backup relays the primary's message, and therefore the total number of relays
depend on the number of backups. Thus, more relays should have caused more
interference, resulting in higher response times.
The next experiment (shown in Figure 7.7) showed us the reason for the re-
sponse time being independent of the degree of replication. We reduced the com-
pute time to 2 ms, thereby increasing the possibility of interference. We now saw
the positively sloped curve for receiv&omission failures that we expected earlier.
In the earlier figures, the number of messages generated by the backups was not
sufficient to interfere with future requests, because a small number of requests were
being sent per unit time. On the other hand, when we increased the frequency of
the requests, the interference became large enough to increase the response times.
Note, however, that the response time of the 2?--Hblocking protocol was never
167
16
14
Response 12
Time			10
(in ms)			8
6
4:
18
crash ?
receive-omission
2?-bIocking 0
2			I			I
1			2			3			4			5
Number of servers
6
Figure 7.6: Avg. response time vs. degree of replication. Number of clients = 2
and compute time = 10 ms
independent of the number of servers. As described in the implementation of
the broadcast procedure for the 2?--Hblocking protocol, the primary waits for the
backups to acknowledge before responding to the client. Thus, the number of
messages that the primary waits for increases with the number of servers, causing
the response time to increase as well.
We now extrapolate our results to higher degrees of replication and to more
clients. Keeping other parameters constant, we should expect the response time
of the crash failure protocol to be always independent of the number of servers.
This should be the case, because as discussed before, the primary always sends
one message per request. However, the exact value of this response time will
clearly depend on the number of clients and the frequency of requests. On the
other hand, for reasons described in the previous paragraph, the response time of
the 26--Hblocking protocol should increase more or less linearly with the number of
168
16
14
Response 12
Time			10
(in ms)			8
6
crash ?
receive-omission I
2?--HbIocking E
9			I			I			I I
1			2			3			4			5
Number of servers
6
Figure 7.7: Avg. response time vs. degree of replication. Number of clients = 2
and compute time = 2 ms
servers.
The situation is a little more complicated for receive?mission failures, because
our experiments obtained both flat and positively sloped graphs for this protocol.
Given the results in Figure 7.7, one can expect the response times in Figures 7.5
and 7.6 to eventually start increasing, possibly as a step function, as the number
of servers increase. A similar behavior should be seen if we increase the number of
clients. However, unlike the point--Hto--Hpoint case, the response time of the receive--H
omission protocol should always be less than the response time of the 2?--Hblocking
protocol. This is because the former never sends more messages than the latter.
7.4 Extending the results to runs with failures
We did not consider failures in the previous section because the results in the case
of failures will depend on the underlying system and the particular service being
169
implemented. For example, the amount of time it takes to reintegrate a new server
usually depends on the service semantics and the amount of information that has
to be sent to the new server.2 Similarly, the failover time depends on the maximum
message delivery time. However, we can still draw some useful conclusions about
runs with failures from our results.
If failures are rare and the 0--Hblocking and 26--Hblocking protocols have similar
failover and reintegration times, then it is easy to see that the results in the previous
sections should still be qualitatively true i.e. the 0--Hblocking protocols can achieve
a smaller response time than 26--Hblocking protocols in many cases. In addition,
these results can also be used in deciding the right degree of replication to be used
when implementing these protocols. If servers use point--Hto--Hpoint communication,
then we saw that the average response time of all protocols (0--Hblocking and 26--H
blocking) rises with the number of servers. Therefore, if failures are infrequent
(which is usually the case), and reintegration does not cost too much (which was
also true in our counter service), then in designing a primary--Hbackup service that
uses point--Hto--Hpoint communication, one should keep the degree of replication as
low as possible, and reintegrate servers only when the older servers have crashed.
This should result in a smaller average response time since the degree of replication
is always kept low.
On the other hand, if communication can be done using hardware broadcasts,
then we saw that the response time of 0--Hblocking protocols can be independent
of the number of servers. In this case, it may be possible to keep the degree of
replication for these protocols high (since this does not increase the response time),
2!n our case, reintegration took almost no time because only the current counter value and
some information regarding the last request sent by each client needed to be sent to the new
server.
170
and save on some reintegration cost as well, further reducing the response time. We
can achieve this by not doing any reintegration until a large number of servers have
crashed, and then reintegrating more than one server at a time, thereby saving on
the total reintegration time. This clearly assumes that servers can be reintegrated
in parallel without too much additional expense.
7.5 Discussion
In this chapter, we analyzed a subclass of primary--Hbackup protocols, i.e. 0-
blocking protocols, and showed that these protocols can be constructed for the
kinds of failures that are expected to occur in many primary--Hbackup systems. As
expected, our experiments showed that these protocols can achieve a very small
response time as compared to the 2?--Hblocking protocols, and therefore, these pro-
tocols should be preferred when designing many primary-backup systems.
However, our results also showed that the 0-blocking property is not always
enough to achieve a small response time. We designed the 0--Hblocking protocol
tolerating receiveomission failures using the translation technique [NT88], as this
technique resulted in a protocol that used the optimal number of servers and had
the optimal failover time. This technique, however, can create protocols that send
a large number of messages, and as we showed in Section 7.3, this large number
of messages sometimes can lead to overly-high contention on the network, causing
the response time of the receiveomission protocol to become large. Thus the
communication cost of a protocol is also an important parameter in determining
the response time, and we are currently looking at ways to reduce this cost.
Chapter 8
Survey of Existing
Primary--HBackup Protocols
This chapter presents a brief survey of the primary--Hbackup protocols that exist
in the literature. This survey is by no means complete because we have only
included protocols that we consider to be well known. If the reader is interested,
the bibliography gives references to some other protocols that we have omitted.
Furthermore, in this survey it is not possible for us to talk about every possible
characteristic of the primary--Hbackup protocols. We will, therefore, concentrate
only on those characteristics that relate to this thesis. These are the system model
(asynchronous or synchronous), the types of failures that are tolerated by the
protocols, and the values of the three metrics--Hdegree of replication, blocking time
and the failover time.
The rest of the chapter presents the various protocols in chronological order.
Section 8.1 presents a primary--Hbackup protocol due to Alsberg and Day [AD76],
Section 8.2 presents Tandem [Bar8i], Section 8.3 presents Viewstamp Replica-
171
172
tion [OL88], Section 8.4 presents ECHO [MHS891, Section 8.5 presents HA-NFS
[BEM9l] and Section 8.6 presents HARP [LGC+91]. Most of the descriptions
of the protocols have been taken directly from the above references. They have,
however, been edited for conciseness. Furthermore, while giving the values for
blocking time and failover time, we will assume that local computation at a server
takes negligible time (since this is what we had assumed in our system model), and
concentrate only on the delays due to message delivery.
8.1 The Alsberg--HDay Protocol
We believe this protocol to be the earliest primary--Hbackup protocol appearing in
the literature. The protocol is synchronous, and tolerates server crashes.1 The
protocol that we outline in this section only tolerates a single server crash, but it
can be extended to tolerate multiple failures.
In this protocol, a client sends a request to the service and then blocks waiting
for either a response from the service or a timeout.
o+ If the request arrives at the primary, then the primary performs the requested
update, sends a state update message to the backup, and blocks. The backup,
upon receiving the state update message, updates its state, sends the response
to the client, and finally sends an acknowledgement to the primary saying
that it performed the update. On receiving the acknowledgement, a primary
can unblock and process the next pending request.
o+ If the request arrives at the backup, then the backup forwards the request to
`The authors also claim that the protocol tolerates network partitions. However, during
partitions the primary and the erstwhile backup can diverge, giving inconsistent responses. In
the analysis that follows, we assume that partitions do not occur.
173
the primary. The primary, upon receiving the forwarded request, performs
the update, sends the response to the client, and finally sends a state up-
date message to the backup (which then updates its state and discards the
request).
Failures are detected by lost acknowledgement messages. In addition, failures
are also detected by sending periodic "Are you alive" messages. In case the primary
fails, the backup takes over as the new primary.
The above protocol requires two servers to tolerate a single server crash and
has a blocking time of ?, where 6 is the maximum message delivery delay. The
failover time depends on the frequency that "Are you alive" message are sent. If
we assume that the period between "Are you alive" messages is T, then the failover
time for this protocol is r + 26.
8.2 The Tandem Protocol
This protocol is synchronous and is designed to tolerate a single crash+link failure.
Any Tandem system consists of multiple nodes connected by a network. Each of
these nodes consists of multiple processor and 1/0 controller modules intercon-
nected by redundant buses. Each processor in the node can support concurrent
processes (system or application), and the goal of the system is to make these
processes fault--Htolerant.
Processes are made fault--Htolerant by using process-pairs. Process pairs are
implemented by replicating each process on two different processors in the node,
with one process being the primary and the other being the backup. Requests are
sent to the primary of such a pair. The primary then sends a state update message
174
to the backup over one of the redundant busses. Once an acknowledgement is re-
ceived from the backup, the response is sent to the client. If an acknowledgement is
not received for some time (one second in the protocol), then the underlying mes-
sage mechanism resends the state update message over the second bus. Sequence
numbers are used in order to prevent duplicates.
The backup process becomes the primary when it detects that the processor
on which the primary resided has crashed, as follows. Every processor in the
node periodically sends an "I am alive" message to all other processors, over all
the redundant buses. If such a message is not received from a processor, then that
processor is declared crashed and any backup whose primary was on that processor
becomes the primary.
The above protocol uses two processors to tolerate a single failure. As can
be seen from the description, the blocking time is 2? (one seconds in the proto-
col). Also, the failover time of the protocol is 4 seconds. However, just from the
description in the paper, we are unable to relate this time to r and ?.
8.3 The Viewstamped Replication Protocol
This protocol is asynchronous and tolerates failures of both the servers and the
network. Servers can crash, and the network can lose, delay and duplicate mes-
sages, or deliver messages out of order. Furthermore, link failures can cause the
network to partition into subnetworks.
The computation model consists of objects (called modules) and objects can
make calls on each other. The caller object is the client and the callee is the
server. The goal is to make the objects fault--Htolerant by using replication. This
175
is done by designating one copy of the object as the primary and the others as
backups. The primary is responsible for the processing of transactions [13HG871
that use the object; it notifies the backups of what it has done.
The various replicas of an object monitor each other by sending periodic "I am
alive" messages. If a replica notices that it is not communicating with some other
replica, or if it notices that it is communicating with a replica that it could not
communicate with previously, it initiates a view change. This view change might
result in a reorganization with a new primary being selected if necessary. However,
if a view change occurs, then it is ensured that the effect of committed transactions
survives this change. Informally, this is guaranteed by ensuring that the new view
contains at least a majority of the replicas. One of these replicas must have been
active in the last view, and therefore would have an up-to-date copy of the object's
state. This up-to-date replica is identified by using a new kind of timestamp, called
viewstamp, and is then used to bring the other replicas in the newly formed view
up-to-date.
However, the view change algorithm given in the paper is not guaranteed to
terminate. Message delays due to the asynchrony of the system might cause the al-
gorithm to run forever without selecting a new primary. Thus, the primary-backup
protocol is not live, and as we showed in Chapter 2, this is in fact unavoidable in
an asynchronous system.
The above protocol tolerates any number of server failures. However, it requires
at least a majority of the servers to be up in order to make progress. Since the pro-
tocol is asynchronous, the blocking time and the failover time are both unbounded.
However, for asynchronous protocols, it is possible to change the definition of block-
176
ing so that a blocking protocol is one that requires a message exchange before a
response can be sent, whereas a non--Hblocking protocol does not require such an
exchange.2 We do not address this issue in this thesis.
8.4 ECHO
This protocol is asynchronous in the sense that there is no bound on the speeds
of servers or messages. However, the protocol assumes that each server has a local
clock, and as long as the server is up, this clock has a bounded drift rate of p. This
clock will be used by the protocol to measure time intervals. The ECHO protocol
tolerates sever crashes and network failures. The network may duplicate or lose
messages, but it does not alter or spontaneously generate messages.
The protocol was. designed to provide a replicated, fault--Htolerant file system.
Copies of a file are replicated across severs, with one server being the the primary
and the remaining servers being backups. A server 5 can become the primary only
if a sub-majority of the other servers (who will act as backups) accept s as the
primary. Once 5 becomes the primary, it remains so for some bounded amount of
time, following which it has to ask the backups again whether it can continue to
be the primary. This bounded time interval can be measured using the clock at
s. Since a server can become the primary only if it can convince a majority of the
servers to be its backups, the protocol ensures that there is at most one primary
at any time.
Read and write requests to a file are sent to the primary. If the primary receives
a read request, it handles it locally because there can be no concurrent writes in
2The viewstamped replication protocol is, therefore, blocking under this definition.
177
progress (all writes have to go through the primary). If a write request arrives,
then the primary sends the write request to the backups, and starts an internal
timer. If the timer expires before all the acknowledgements are received, then an
exception is raised. On the other hand, if all the acknowledgement are received,
then the write completes. This ensures that completed writes survive failures in
a manner similar to the protocol in the previous section (however, the techniques
used to identify an up-to-date server are quite different). The protocol also allows
for caching at the clients. However, before a client can cache a copy, the client has
to obtain a token for the file from the primary. Information about these tokens is
also replicated at the servers.
The ECHO protocol tolerates any number of server failures, but requires at least
a majority of the servers to be up in order to make progress. The protocol, however,
is not live. In particular, it is possible that due to asynchrony, the protocol may
never elect a new primary after a failure, and so the failover time is unbounded.
The severs in the ECHO protocol maintain clocks that run at the rate of real time,
and, therefore, the blocking time of the protocol is bounded. As described in the
previous paragraph, whenever the primary receives a write request, it sends that
request to the backups and starts an internal timer. If we assume that the length
of this timer is ? (the notation used in the paper), then the blocking time of the
protocol is ?(1 + p).
8.5 HA-NFS
This is a synchronous protocol and the goal of this protocol is to provide a highly
available network file server (HA-NFS) under crash+link failures. The protocol
178
tolerates a single crash failure by using two servers. One server is the primary, the
other is the backup. The servers are connected to a dual-ported disk (in reality,
there could be multiple disks). Only one server (the current primary) has access
to the disk at any time. Disk failures are tolerated by mirroring the disk, and link
failures are tolerated by replicating the network between the clients and the servers.
The dual-ported disk is used as an additional communications link between the two
servers.
During normal operation, client requests are sent to the primary, which writes
the updates to the disk and then replies to the client. The primary does not
inform the backup of the update, because the disk is dual-ported and the backup
can access the disk when it takes over as the primary. The only communication
between the two servers during normal operation is to exchange periodic "Are you
alive" messages that must be acknowledged.
In case the backup does not receive an acknowledgement after repeated "Are
you alive" messages, then either the primary has crashed or the link between
the primary and the backup has failed. In order to prevent the backup from
taking over while the primary is still up, the backup tries to communicate with the
primary using the dual-ported disk hardware. If the backup finds that it cannot
communicate with the primary even over this redundant link, then it becomes the
new primary and takes over control of the dual-ported disk.
The above protocol requires two servers to tolerate a single failure, and has a
blocking time of zero. The failover time depends on the interval between successive
"Are you alive" messages, as well as the number of retries of this message. If
we assume that these values are T and k respectively, then the failover time is
179
r + (k + 1)2?. To simplify things, we have assumed that it takes the same amount
of time to communicate over the link or the disk hardware.
8.6 HARP
This protocol is asynchronous, but the servers have local clocks which are approx-
imately synchronized with the clocks of the other servers. The goal of the protocol
is to provide a reliable file server, and it tolerates power failures, servers failures
and network failures. The network may lose or duplicate messages, or deliver them
late or out of order; it addition it may partition so that some servers are unable to
send messages to some other servers. The protocol described in the paper tolerates
a single failure, but can be extended to tolerate multiple failures.
In this protocol, the copies of the files are replicated over two servers, with one
server being the primary and the other being the backup. There is a third server,
called the witness which takes part in the protocol in case of failures. Updates are
sent to the primary, who first informs the backup and waits for it to acknowledge.
Similar to ECHO, if an acknowledgement is not received within some timeout
interval, then an exception is raised. On the other hand, if an acknowledgement
is received from the backup, then the primary replies to the client. Harp achieves
good performance by recording the effects of an operation in a log that resides
in volatile memory; operations in the log are applied to the file system in the
background. However, Harp tolerates power failures by equiping the servers with
an uninterruptible power supply (UPS). This allows the severs to write the logs to
disk in case the power fails.
The primary and the backup exchange ?`Are you alive" messages with each
180
other in order to detect failures. In case one of the servers detects that it cannot
communicate with the other server, it tries the form a new view along with the
witness, and attempts to elect a new primary if necessary. A new view can be
formed only if it has two servers in it. Since there are only three servers, this
prevents the primary and the backup from concurrently forming two independent
views. The view change protocol given in the paper requires two phases of message
exchange in order to form a new view; one to send a request to the witness asking
it to join the new view, and the second to inform the witness of the formation of
the new view. However, since the protocol is asynchronous, it is possible that due
to message delays, a new primary is never elected.
The protocol requires 2f + 1 servers in order to tolerate f failures, and has un-
bounded failover time. Even though the protocol assumes that the servers maintain
synchronized clocks, it does not mention whether these clocks are also synchronized
with real time. However, if we assume this fact, and assume that these clocks have
some bounded drift rate p, then as in ECHO, Harp also has a blocking time of
?(1 + p) for some timeout interval ?.?
8.7 Summary
Table 8.1 summarizes the values of the three metrics for the protocols surveyed
in this chapter. Since most of the protocol descriptions in this chapter tolerate only
a single failure, the table shows the values of the metrics assuming that the number
of failures to be tolerated equals one. Some of these, however, can be generalized
as described earlier.
We now briefly compare the results in Table 8.1 with those in Table 4.1. How-
3On the other hand, if we do not assume this, then the protocol has a blocking time of 00.
181
Table 8.1: Summary of survey results
Protocol			Failure			Degree of Blocking			Failover
name			type			replication			time			time
Alsberg &
Day			crash			2			6			T + 26
crash +
Tandem			link			2			26
Viewstamped			crash +
replication			partitions			3			Co			Co
crash +
Echo			partitions			3			?(l + p)			Co
crash +
HA-NFS			link			2			0			T + (k+ 1)26
crash +
Harp			partitions			3			?(l + p)			Co
Not available.
ever, since we have c6nsidered only synchronous primary--Hbackup protocols in this
thesis, we restrict our attention to Alsberg & Day, Tandem and HA-NFS.
In the Alsberg & Day protocol, a backup can recive requests from the clients,
and therefore, the protocol does not satisfy Pb3. However, for crash failures, our
lower bound results do not depend on Pb3, and so the Alsberg & Day protocol
is optimal for degree of replication and not optimal for blocking time. The sub-
optimal blocking time is the result of allowing the backup to send a response.
The paper is not clear why the authors chose to allow this. One can hypothesize
that they were concerned with transient link failures. In particular, suppose the
protocol were changed so that the primary sent the response to the client after
queueing the state update message to be sent to the backup. Now if the primary
crashes before the state update message is sent to the backup, then a client has
received a response to a request that was never received by the backup. This would
violate Pb4. By having the backup send the response, as done in the protocol, if a
partition does not occur, then both the primary and the backup will update their
182
state with respect to any request. The failover time of the protocol depends on
the frequency that "Are you alive" message are sent. If we assume that the period
between "Are you alive" messages is T, then the failover time for this protocol
is T + 2?. The protocol does not, however, use synchronized clocks. Our upper
bounds on failover times do assume synchronized clocks. Thus, our upper bounds
on failover time are incomparable. We do not know whether this protocol achieves
optimal failover time.
The Tandem protocol uses two servers to tolerate a single crash failure, and
two links to tolerate a single link failure. Since there are two links between the
two servers, and only one of these links can fail, it turns our that our crash failure
bounds apply to this protocol. This is for the following reason. Our system model
assumed that there is exactly one link between any two processes. In the Tandem
protocol, it is assumed that no more than one of the two links can be faulty. If
this is the case, then any message can be simultaneously sent over both the links,
thus guaranteeing that at most one copy of the message can be lost due to link
failures. All our bounds that hold in the absence of link failures can be applied
to protocols, like Tandem's, that utilize multiple links. The Tandem protocol,
therefore, has optimal degree of replication. The blocking time for this protocol is
2?, and this is not optimal. However, using our optimal protocol would increase
message traffic, which Tandem might not want to do. Finally, because this protocol
does not assume synchronized clocks, the optimality of its failover time remains an
open question.
As with the Tandem protocol, our lower bounds for crash failures apply to HA-
NFS as well because only one of the communication channels between the servers
can fail. The HA-NFS protocol requires two servers to tolerate a single failure, and
has a blocking time of zero. Thus the protocol has optimal degree of replication
and optimal blocking time. The failover time depends on the interval between
successive "Are you alive" messages, the number of times it is sent before detecting
183
a failure, and the time needed to c?mmunicate using the disk as a channel. The
optimality of the failover time remains an open question because this protocol does
not assume synchronized clocks (and our bounds do).
Chapter 9
Conclusions
This thesis presents a systematic analysis of the primary--Hbackup approach, both
from the theoretical viewpoint of specification, lower bounds and upper bounds,
as well as from the practical viewpoint of performance tradeoffs among protocols.
In Chapter 2, we begin by showing that it is impossible to construct primary--H
backup protocols that are both safe and live in an asynchronous environment.
Chapter 3 then presents a precise specification for primary--Hbackup protocols in a
synchronous environment, and using this specification, Chapter 4 presents lower
bounds for three key cost metrics of primary--Hbackup protocols--Hdegree of replica-
tion, blocking time and failover time. These lower bounds characterize the mini-
mum number of servers needed to implement any protocol (ns), the amount of time
it can take to respond to a client request, and the amount of time during which
there may not be a primary. The lower bounds depend on three parameters--H
the message delivery delay (?), the number of failures to be tolerated (f), and the
class of failures to be tolerated. In our analysis, we consider the following classes of
failures crash, crash+link, send?mission, receive?mission and general?mission
184
185
failures.
In some cases, our lower bound results were very surprising. For example, in
earlier results that we are aware of, the degree of replication only affected the cor-
rectness of the protocol, and never the performance. However, for receive?mission
failures, it is the case that the degree of replication affects both the correctness
as well as the performance of the protocol. In particular, if we increase the de-
gree of replication, then we can construct a more efficient protocol by reducing
the blocking time. Using these lower bounds, we have also been able to under-
stand some previously unexplained properties of primary-backup protocols. For
example, some protocols use more than f + 1 servers in order to tolerate f fail-
ures. Our lower bounds on the degree of replication, however, show that this is in
fact necessary for some kinds of failures. Similarly, in some other protocols, the
response time is large because the primary has to wait for acknowledgements from
the backups before it can respond to a client request . Again, we have been able
to show that for some kinds of failures, this is also necessary.
Chapters 5 and 6 show primary-backup protocols, and prove that all except
two of our lower bounds are tight. These two lower bounds correspond to the de-
gree of replication and the blocking time respectively for receive?mission failures.
Showing these lower bounds tight, therefore, remains an open problem. Here also,
there were some surprising results. For example, in case of receive?mission fail-
ures, the optimal protocol in one case had a surprising (and undesirable) scenario:
a non-faulty primary is forced to relinquish control to a backup that it knows to
be faulty! The existence of such a scenario is, however, not peculiar to our pro-
tocol. We show that relinquishing control to a faulty backup is indeed necessary
lS6
Table 9.t: Summary of lower bound and upper bound results
Failure			Degree of			Blocking			Failover
model			replication			time			time
crash			fls > f			0
crash+link			fls > f + 1			0			2f?
receive			6 if ?s = 2f and f = 1
omission			?s > [?J * ?			26 if ?s < 2f and f> 1 *			2f6
0 if n8 > 2f
send			6			if f =			1
omission			?s > f			26			if f>			1			2f6
6			iff=			1			2f6
omission			?s > 2f			26			iff> 1
D < F assumed.
* Bound not known to be tight.
to achieve optimal protocols in this failure model. Another surprising property
was the following. In some protocols, in order to achieve the optimal response
time, the sever that received the request (i. e. the primary) was not the server that
sent the response to the client. We show that this is necessary as well. Table 9.1
summarizes our lower bound and upper bound results.
As we had mentioned earlier, the contributions of this thesis are not entirely
of a theoretical nature. In Chapter 7, we implement and analyze an interesting
subclass of primary--Hbackup protocols--Hprotocols that can theoretically achieve the
smallest possible response time. Even though these protocols, called 0--Hblocking
protocols, can only be constructed for a restricted class of failures (as shown in
Chapter 4), we argue that these are exactly the kinds of failures that are expected
to occur in many practical systems, and as a result, one can implement 0--Hblocking
primary--Hbackup protocols for these systems.
In order to obtain our performance results, we implement these protocols on a
cluster of RS/6000 550s connected over a 16Mbit/s token ring. As expected, our
187
experiments show that these protocols can achieve a very small response. In fact,
some of the response times achieved by these protocols were less than 1 ms larger
than the response time of an unreplicated service.
However, surprisingly, there were cases under which the 0--Hblocking protocol
for receive--Homission failures did not achieve a small response time. In particu-
lar, when servers used point-to--Hpoint communication, then under certain circum-
stances (which depended both on the number of servers and clients, and on the
frequency of client requests), this protocol lead to overly-high contention on the
network, causing its response time to become large. Thus, the 0-blocking property
of a protocol may not sufficient under all conditions to achieve a small response
time, and the message cost of a protocol might become the dominating factor un-
der high loads. We are currently looking at ways to reduce the message cost of our
protocols.
Finally, in Chapter 8, we present a brief survey of existing primary--Hbackup
protocols. Applying our results to the synchronous protocols, we show that the
Alsberg & Day protocol, Tandem and HA-NFS have optimal degrees of replication.
HA-NFS also has optimal blocking time, whereas Alsberg & Day and Tandem
do not have optimal blocking times. Our results on failover time, however, are
incomparable because we assume synchronized clocks and the protocols we survey
do not assume this.
The results of this thesis appear in the following papers: [BMST9l,BMST92a,
BMST9Th,BM92,BMST93j
9.1 Future Directions
In this thesis, we analyze only synchronous primary-backup protocols because
we had shown that it is impossible to construct safe and live protocols in an asyn-
chronous environment with failures. Our impossibility result relied on the fact that
consensus can not be solved in an asynchronous environment. However, weakenings
188
of the asynchronous model, in which consensus can be solved, have been studied
in the literature [DLS88,CT9li, and therefore, it might be possible to construct
primary--Hbackup protocols in such a weakened environment. Finding lower bounds
and upper bounds on primary-backup protocols in such cases remains an open
problem.
As we had mentioned earlier, another technique for building replicated services
is the state--Hmachine approach. Even though both the state-machine approach and
the primary-backup approach have been used in practice, the tradeoffs between
these two approaches have not been systematically studied. For example, it is
not known which of the two approaches is better suited to any particular applica-
tion. However, given the results of this thesis about the primary--Hbackup approach,
and similar existing results for the state--Hmachine approach, we hope that it will
be possible to compare and contrast the two approaches, and better identify the
applicability of either approach to a given application.
Bibliography
[AD76]
[ADKNI92]
P.A. Alsberg and J.D. Day. A principle for resilient sharing of dis-
tributed resources. In Proceedings of the Second International Con-
ference on Software Engineering, pages 627--H644, October 1976.
Yair Amir, Danny Dolev, Shiomo Kramer, and Dalia Malki. Transis:
A communication sub-system for high availability. In The Twenty-
Second International Symposium on Fault-Tolerant Computing Sys-
tems, pages 76--H84. IEEE Computer Society Technical Committee on
Fault-Tolerant Computing, July 1992.
[AFM92] Susan M. Armstrong, Alan 0. Freier, and Keith Marzullo. Mufticast
transport protocol. Internet RFC 1301, February 1992.
[AS86] B. Alpern and F.B. Schneider. Defining liveness. Information Pro-
cessing Letters, 23:177--H180, November 1986.
[BAA+92] 0. Babaoglu, L. Alvisi, 5. Amoroso, R. Davoli, and L. A. Giachini.
Paralex: An environment for parallel programming in distributed sys-
tems. In Proceedings of the 6th ACM International Conference on
Supercomputing, pages 178--H187, July 1992.
[BarSil
J.F. Barlett. A N0n5t0pTM Kernel. In Proceedings of the Eighth
ACM Symposium on Operating System Principles, SIGOPS Operating
System Review, volume 15, pages 22--H29, December 1981.
[BBG+89] A. Borg, W. Blan, ?V. Graetsch, F. Herrmann, and W. Oberle. Fault
tolerance under UNIX. AC?? TOCS, 7(1):1--H24, February 1989.
[BEM91] Anupam Bhide, E.N. Elnozahy, and Stephen P. Morgan. A highly
available network file server. In Winter 1991 USENIX Conference,
pages 199--H205. USENIX Association, January 1991.
[BGT9O] Navin Budhiraja, Ajei Gopal, and Sam Toueg. Early stopping dis-
tributed bidding and applications. In Proceedinds of the Fourth Inter-
189
190
[BHG87i
[BJ87]
[B NI92]
[BMST91]
[BMST92a]
[BMST92b]
[BMST93j
[CASD85]
national Workshop on Distributed Algorithms, pages 304--H320, Septem-
ber 1990. Springer Verlag, Lecture Notes in Computer Science 486.
Goodman. Con-
Addison-?'esley,
Philip A. Bernstein, Vassos Hadzilacos, and Nathan
currency Control and Recovery in Database Systems.
1987.
Kenneth P. Birman and Thomas A. Joseph. Reliable communication
in the presence of failures. ACAl Transactions on Computer Systems,
5(1):47--H76, February 1987.
Navin Budhiraja and Keith Marzullo. Highly-available services using
the primary-backup approach. In Proceedings of the Second Work-
shop on the ?anagement of Replicated Data (?MRD-H), pages 47--H
50, Monterey, CA, November 1992.
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg.
Lower bounds for primary backup implementations of bofo services. In
Proceedings of the OiVR Second Annual Workshop on Ultradependable
`lulticomputers and Electronic Systems, Washington D.C., pages 81--H
86, November 1991.
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg.
Optimal primary-backup protocols. In Proceedings of the Sixth Inter-
national Workshop on Distributed Algorithms, pages 362--H378, Haifa,
Israel, November 1992.
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg.
Primary-backup protocols: Lower bounds and optimal implementa-
tions. In Proceedings of the Third IFIP Working Conference on De-
pendable Computing for Critical Applications, pages 187--H198, Mon-
dello, Italy, September 1992.
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg.
The primary--Hbackup approach. In Sape Mullender, editor, Distributed
Systems. ACM Press, 1993.
Flaviu Cristian, Houtan Aghili, H. Ray Strong, and Danny Dolev.
Atomic broadcast: From simple message diffusion to Byzantine agree-
ment. In Proceedings of the Fifteenth International Sympos?um on
Fault- Tolerant Computing, pages 200--H206, Ann Arbor, Michigan, June
1985. A revised version appears as IBM Technical Report RJ5244.
Flaviu Cristian, Bob Dancey, and Jon Dehn. Fault-tolerance in the
advanced automation system. In 20 Int. Symp on Fault- Tolerant Com-
puting Systems (FTCS 20), Invited Paper, pages 6--H17, June 1990.
[CDD90i
191
[CDS90]
[C xI84]
[Cri89]
[CT91]
[D DS87]
[DLS88]
Flaviu Cristian, Danny Dolev, and Ray Strong. New latency bounds
for atomic broadcast. Technical report, IBM Almaden Research Cen-
ter, 1990.
J. Chang and N. Maxemchuk. Reliable broadcast protocols. AC'?i
Transactions on Computer Systems, 2(3):251--H273, August 1984.
Flaviu Cristian. Probabilistic clock synchronization. Distributed
Computing, 3(3):146--H158, 1989.
Tushar D. Chandra and Sam Toueg. Unreliable failure detectors for
asynchronous systems. In Tenth AGNI Symposium on Principles of
Distributed Computing, pages 325--H340, Montreal, Canada, August
1991. ACM SIGOPS-SICACT.
Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal
synchronism needed for distributed consensus. Journal of the ACM.
34(1):77--H97, January 1987.
Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in
the presence of partial synchrony. Journal of the ACM, 35(2):288--H323,
April 1988.
[DRS90] D. Dolev, R. Reischuk, and H.R. Strong. Early stopping in byzantine
agreement. JACM, 37:720--H741, 1990.
[Fis83]
[FLP85]
[GSTC90]
[GT91]
Michael J. Fischer. The consensus problem in unreliable distributed
systems (a brief survey). Technical Report DCS/RR-273, Department
of Computer Science, Yale University, June 1983.
Michael J. Fischer, Nancy A. Lynch, and Michael 5. Patterson. Im-
possibility of distributed consensus with one faulty process. Journal
of the ACM, 32(2):374--H382, April 1985.
Ajei Gopal, Ray Strong, Sam Toueg, and Flaviu Cristian. Early-
delivery atomic broadcast. In Proceedings of the Tenth AC? Sympo-
sium on Principles of Distributed Computing, pages 297--H310, Qeubec
City, Quebec, August 1990. ACM SIGOPS-SIGACT.
Ajei Gopal and Sam Toueg. Inconsistency and contamination. In
Tenth ACi'JSymposium on Principles of Distributed Computing, pages
257--H272, Montreal, Canada, August 1991. ACM SIGOPS-SIGACT.
Vassos Hadzilacos. Jssues of Fault Tolerance in Concurrent Computa-
tions. Ph.D. dissertation, Harvard University, June 1984. Department
of Computer Science Technical Report 11-84.
[Had84]
192
[Her9 1]
[NO87]
[Lam7S]
[Lam85i
[LF82]
[LGG+91i
[LL86]
[LM584]
Maurice P. Herlihy. Wait free synchronization. AC;J Transactions on
Programming Languages and Systems, l1(1):124--H149, January 1991.
Hermann Kopetz and Wilhelm Oclisenreiter. Clock synchronization
in distributed real-time systems. fEEE Transactions on Computers,
C-36( 8) :933--H940, August 1987.
Leslie Lamport. Time, Clocks, and the Ordering of Events in a Dis-
tributed System. Communications of the ACAl, 21(7):558--H565, July
1978.
Leslie Lamport. Basic Concepts. In Distributed Systems--H1?ethods and
Toots for Specification. Lecture Notes in Computer Science, Volume
193, SpringemVerlag, Berlin, 1985.
Leslie Lamport and Michael Fischer. Byzantine generals and transac-
tion commit protocols. Op. 62, SRI International, April 1982.
Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson,
and Michael Williams. Replication in the Harp file system. In Pro-
ceedings of the 13th Symposium on Operating System Principles, pages
226--H238, October 1991.
Barbara Liskov and Rivka Ladin. Highly-available distributed services
and fault-tolerant distributed garbage collection. In Proceedings of the
Fifth ACM Symposium on Principles of Distributed Computing, pages
29--H39, Calgary, Alberta, ?August 1986. ACM SIGOPS-SIGACT.
Leslie Lamport and Peter M. Melliar-Smith. Byzantine clock synchro-
nization. In Proceedings of the Third ACM Symposium on Principles
of Distributed Computing, pages 68--H74, Vancouver, British Columbia,
August 1984. ACM SIGOPS-SIGACT.
[LMS85j Leslie Lamport and P. NI. Melliar-Smith. Synchronizing clocks in the
presence of faults. Journal of the AC1?, 32(1):52--H78, January 1985.
[MHS89] Andy Hisgen, and Garret Swart. An algorithm for
Technical Report 46, Digital Systems Research Cen-
Timothy Mann,
data replication.
ter, 1989.
Keith Nlarzullo and Frank Schmuck. Supplying high availability with
a network file system. In Proceedings of the 8th International Confer-
ence on Distributed Computing Systems, pages 447--H453, May 1988.
[MS88]
193
[NT87]
[NTS8]
[OL88j
[PSL8O]
[PT861
[SB89j
[Sch80]
[Sch90j
[Sie92i
Gil Neiger and Sam Toueg. Substituting for real time and common
knowledge in asynchronous distributed systems. In Sixth AC?J Sym-
posium on Principles of Distributed Computing, pages 281--H293, Van-
couver, Canada, August 1987. ACM SIGOPS-SIGACT
Gil Neiger and Sam Toueg. Automatically increasing the fault-
tolerance of distributed systems. In Proceedings of the Seventh ACAL
Symposium on Principles of Distributed Computing, pages 248--H262,
Toronto, Ontario, August 1988. ACM SIGOPS-SIGACT.
B. Oki and Barbara Liskov. Viewstamped replication: A new primary
copy method to support highly available distributed systems. In Sev-
enth ACAl Symposium on Principles of Distributed Computing, pages
8--H17, Toronto, Ontario, August 1988. ACM SIGOPS-SIGACT.
M. Pease, R. Shostak, and Leslie Lamport. Reaching agreement in the
presence of faults. Journal of the ACiNI, 27(2):228--H234, April 1980.
Kenneth J. Perry and Sam Toueg. Distributed agreement in the pres-
ence of processor and communication faults. IEEE Transactions on
Software Engineering, 12(3):477--H482, March 1986.
N. A. Speirs and P. A. Barrett. Using passive replicates in delta-4
to provide dependable distributed computing. In The Nineteenth In-
ternational Symposium on Fault-Tolerant Computing Systems, pages
184--H190, June 1989.
Fred B. Schneider. Ensuring consistency on a distributed database
system by use of distributed semaphores. In Proceedings of the Inter-
national Symposium on Distributed Databases, pages 183--H189, Paris,
France, March 1980.
Fred B. Schneider. Implementing fault tolerant services using the
state machine approach: A tutorial. Computing Surveys, 22(4):299--H
319, December 1990.
Alexander Siegel. Performance in Flexible Distributed File Systems.
Ph.D. dissertation, Cornell University, Department of Computer Sci-
ence, 1992. Department of Computer Science TR 92-1266.
Richard D. Schlichting and Fred B. Schneider. Fail-stop processors:
an approach to designing fault-tolerant computing systems. ACVI
Transactions on Computer Systems, 1(3):222--H238, August 1983.
[5583]
194
J.H \Vensley, L Lamport, J. Goldberg, M.W Green, K.N. Levitt,
P.Ni. Melliar-Smith, R?.E. Shostak, and C.B. Weinstock. SIFT: Design
and analysis of a fault-tolerant coinputer for aircraft control. Proceed-
ings of fEEE, 66(lO):124O--H1255, October 1978.
[WLG+781
