BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR92-1265
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Primary-Backup Protocols: Lower Bounds and Optimal Implementations
AUTHOR:: Budhiraja, Navin
AUTHOR:: Marzullo, Keith 
AUTHOR:: Schneider, Fred B. 
AUTHOR:: Toueg, Sam 
DATE:: January 1992
PAGES:: 18
ABSTRACT::
We present a formal specification of primary-backup. We then prove lower 
bounds on the degree of replication, failover time and worst-case response 
time to client requests assuming different failure models. Finally, we 
outline primary-backup protocols and indicate which of our lower bounds are 
tight.

Keywords: Fault-tolerance, reliability, availability, primary-backup, 
lower bounds, optimal protocols.
END:: CORNELLCS//TR92-1265
BODY::
Primary-Backup Protocols: Lower Bounds and
Optimal Implementations
Navin Bu?dhiraja*
Keith Marzullo*
Fred B. Schneider**
Sam Toueg***
TR 92-1265
January1992
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
*Supported by Defense Advanced Research Projects Agency (DoD) under NASA
Ames grant number NAG 2-593 and by grants from IBM and Siemens. The views,
opinions and findings contained in this report are those of the authors and should not
be construed as an official Department of Defense position, policy or decision.
**Supported in part by the Office of Naval Research under contract N0001 4-91 -J-
1219, the National Science Foundation under Grant No. CCR-8701 103, DARPNNSF
Grant No. CCR-9014363 and by a grant from IBM Endicott Programming Laboratory.
***Supported in part by NSF grants CCR-8901780 and CCR-9102231 and by a
grant from IBM Endicott Programming Laboratory.
Primary--HBackup Protocols:
Lower Bounds and Optimal Implementations
Navin Budhiraja*
Keith Marzullo*
Fred B. Schneidert
Sam Touegt
Department of Computer Science
Cornell University
Ithaca NY 14853, USA
Abstract
We present a precise specification of the primary--Hbackup approach. Then, for a
variety of different failure models we prove lower bounds on the degree of replication,
failover time, and worst-case blocking time for client requests. Finally, we outline
primary--Hbackup protocols and indicate which of our lower bounds are tight.
Keywords: Fault-tolerance, reliability, availability, primary--Hbackup,lower bounds, optimal
protocols.
1 Introduction
One way to implement a fault-tolerant service is by using multiple servers that fail inde-
pendently. The state of the service is replicated and distributed among these servers, and
updates are coordinated so that even when a subset of servers fail, the service remains
available.
Such fault-tolerant services are generally structured in one of two ways. One approach is
to replicate the service state at all servers and to present all client requests, in the same order,
to all non-faulty servers. This service architecture is commonly called active replication or the
*Supported by Defense Advanced Research Projects Agency (DoD) under NASA Ames grant number
NAG 2--H593 and by grants from IBM, Siemens, and Xerox. Budhiraja is also supported by an IBM Graduate
Fellowship, The views, opinions, and findings contained in this report are those of the authors and should
not be construed as an official Department of Defense position, policy, or decision,
tSupported in part by the 0ffice of Naval Research under contract N00014-91-J-1219, the National Science
Foundation under Grant No. CCR-8701103, DARPA/NSF Grant No. CCR-9014363, and by a grant from
IBM Endicott Programming Laboratory.
?Supported in part by NSF grants CCR-890178o and CCR-9102231 and by a grant from IBM Endicott
Programming Laboratory.
state machine approach [22] and has been widely studied from both theoretical and practical
viewpoints (e.g., [9, 11,19]).
The other approach to building replicated services is to designate one server as the
primary and all the others as backups. Clients make requests by sending messages only
to the primary. If the primary fails, then a failover occurs and one of the backups takes
over. This service architecture is commonly called the primary-backup or the primary-copy
approach [1] and has been widely used in commercial fault-tolerant systems. However, the
approach has not been analyzed nearly as extensively as the state machine approach. Little
is known of its costs and tradeoffs, the degree of replication required, or the worst-case
response time for various failure models. In this paper, we derive some of these tradeoffs.
For example, in some primary--Hbackup protocols [15] the number of servers used is more than
twice the number of failures to be tolerated. We are now able to explain this phenomenon
by showing that the number of servers needed depends on the failure model.
With both active replication and the primary-backup approach, the goal is to provide a
client with the illusion of a service that is implemented by a single server, despite failures.
The key difference between active replication and the primary-backup is how each handles
failures. With active replication, the effects of failures are completely masked by voting,
and the service implemented is indistinguishable from a single non-faulty server. With the
primary-backup approach, a request to the service can be lost if it is sent to a faulty primary.'
Thus, clients can observe the effects of failures. However, the periods during which requests
can be lost are bounded by the length of time that can elapse between failure of the primary
and takeover by a backup. Such behavior is an instance of what we call bofo (bounded outage
finitely often).
To formulate the notion of a bofo server, define a server outage to occur at time t if some
client makes a server request at that time but never receives a response to that request.2 In
a (k, A)--Hbofo server, all server outages can be grouped into at most k intervals of time, with
each interval having length at most A. Accordingly, even though some requests made to a
bofo service (that is, a service that implements the abstraction of a bofo server) will be lost,
this number is limited. Note that if clients of a service are restricted to send requests only
to one server, then it is not possible to implement a specification that is stronger than bofo.
This is because if the client sends a request to a (single) server and that server subsequently
crashes, then the request can be lost and will not be processed.
This paper gives lower bounds for various costs associated with implementing a bofo
service by using the primary-backup approach. These lower bounds depend on message
delivery delay and the class of failures to be tolerated. These bounds characterize the degree
of replication, the time during which the service can be without a primary, and the amount
of time it can take to respond to client requests (blocking time). In some cases, our results
are surprising. For example, more than f + 1 servers are necessary to tolerate f failures of
certain types (crash and link failures, receive-omission failures, or general-omission failures).
Also, we have proved that if a majority of the servers can be faulty, then any primary--H
backup protocol for receive--Homission failures will have a run in which a non-faulty primary
is is forced to let a faulty server become the primary in its place. Finally, we outline some
1Of course, the client can subsequently resend a copy of that request to the new primary.
2For simplicity, we assume in this paper that every request elicits a response.
primary--Hbackup protocols. This allows us to determine which of our lower bounds are tight.
The paper is organized as follows. Section 2 gives a precise specification of the primary-
backup approach. Section 3 describes the system model we consider. Section 4 discusses
lower bounds, and in Section 5 we outline our protocols and state which of our bounds are
tight. We conclude in Section 6.
2 Specification of the Primary--HBackup Approach
Since we wish to derive lower bounds, we must first give a precise specification of primary-
backup that is general enough to satisfy any protocol one would characterize as being
primary-backup. The following four properties do this.
The first property states that no more than one server can be the primary at any time.
Pb1: There exists a local predicate Prmy? on the state of each server 5. At any time, there
is at most one server s whose state satisfies Prmy?.3
For brevity, whenever we say that "5 is the primary (at time t)" we mean that the state of s
satisfies Prmy? (at time t). We define the failover time of a service to be the longest period
of time during which Prmy8 is not true for any 5.
Property Pb2 distinguishes the primary-backup approach from active replication, where
each client broadcasts its request to all the servers.
Pb2: Each client i maintains a server identity Des4 such that to make a request, client i
sends a message (only) to Des4.
We assume that requests sent to a server 5 are enqueued in a message queue at 5.
Pb3: If a client request arrives at a server that is not the current primary, then that request
is not enqueued (and therefore is not processed).
Properties Pbl--HPb3 specify a protocol for client interactions with a service, but not the
obligations of the service. For example, the properties do not rule out a primary that ignores
all requests. A fourth property eliminates such trivial implementations by stipulating that
the service implements a single bofo server for some values of k and A:
Pb4: There exist fixed values k and A such that the service behaves like a single (k, A)--Hbofo
server.
We believe that the above four properties characterize a primary-backup approach and
have checked that many primary-backup protocols in the literature (e.g. [1, 3, 4, 7]) do
satisfy this characterization.
Note that Pb4 is not implementable if the number of failures (that is, the number of
servers and communication components that fail) can not be bounded a priorn. This is
because an unbounded number of servers would be required to implement the service. In
a practical system, one can implement service outages of bounded lengths by bounding the
rate of failures and allowing reintegration of recovered servers and communication links. We
do not address failure rates or reintegration in this paper.
3The protocol of [15] allows concurrent primaries, but only for bounded periods. If one replaces Pb1 by
this weaker property, then except for the bounds on failover times, the bounds shown in Section 4 continue
to hold.
Figure 1: A Simple Primary--HBackup Protocol.
2.1 A Simple Primary--HBackup Protocol
As an example of a service based on the primary--Hbackup approach, consider the following
protocol, which tolerates crash failures of a single server. Assume that all communication
is over point-to-point non-faulty links and that each link has an upper bound & on message
delivery time.4 Refer to Figure 1. There is a primary server P1 and a backup server P2
connected by a communications link. A client C initially sends all requests to Pi (indicated
by the arrow labeled 1 in the figure). Whenever Pi receives a request, it
o+ processes the request and updates its state accordingly,
o+ sends information about the update to P2 (message 2 in the figure),
o+ without waiting for an acknowledgement from P2, sends a response to the client (mes-
sage 3 in the figure).
The order in which these messages are sent is important because it guarantees that, given
our assumption about failures, if the client receives a response, then either P2 will eventually
receive message 2 or P2 will crash.
Server P2 updates its state upon receiving messages from Pi In addition, Pi sends dummy
messages to P2 every T seconds. If P2 does not receive such a message for T + 6 seconds,
then P2 becomes the primary. Once P2 has become the primary, it informs the clients (who
update their copies of Dest) and begins processing subsequent requests from the clients.
We now show how this protocol satisfies our characterization of a primary--Hbackup pro-
tocol. Property Pbl requires that there never be two primaries. This is satisfied by the
following definitions of Pruty:
Prmy?1 %?ef (Pi has not
Prmy?2 def (P2 has not
crashed)
received a message from Pi for T + 6)
Predicate P?my?1 A Prmy?2 is always false in a system executing our protocol, and hence
Pbl is satisfied. The failover time for this protocol is the longest interval during which
?Prmy?1 A ?Prmy?2 can hold, and it is T + 26 seconds. Property Pb2 follows trivially from
4To simplify exposition, we assume that the maximum message delay between the clients and the servers
is the same as the delay between the servers. However, our results can be easily extended to the case when
the delays are different.
the description of the protocol. Property Pb3 is true because requests are not sent to P2
until after Pi has failed. Finally, Pb4 requires that the protocol implements a single bofo
server for some values of k and A. Since Pi sends message 2 before message 3, it will never
be the case that Pi sends a response to the client and P2 does not get information about
that response from Pi In this protocol, there is at most one switch of the primary. So there
is at most one outage period i.e. k = 1. To compute A, it suffices to compute the longest
interval during which a client request may not elicit a response. Assume that Pi crashes at
time tc. Thus any client request sent to Pi at tc --H or later may be lost since Pi crashes
at tc. Furthermore, P2 may not learn about Pi'? crash until tc + T + 2?, and clients may
not learn that P2 is the primary for another ?. So, the total period during which a request
may not elicit a response is tc --H ? through tc + T + 36: the protocol implements a single
(1, T + 46)--Hbofb server.
3 The Model
We consider a system consisting of n servers and a set of clients. We assume that server clocks
are perfectly synchronized with real time.5 Clients and servers communicate by exchanging
messages through a completely connected point-to-point network.6 Each message sent is
enqueued in a queue maintained by the receiving process, and a process accesses its message
queue by executing a receive statement. We assume that links between processes are FIFO
(?.e. if Pi sends message m followed by m' to process Pi, then Pi will never receive m after
m') and there is a known constant 6 such that if processes Pi and Pi are connected by a
(non-faulty) link, then a message sent from Pi to Pi at time t will be enqueued in Pi'? queue
at of before t + 6.
We are interested in identifying the costs inherent in primary--Hbackup protocols, and so we
assume that it takes no time for a server to compute a response. Our theorems characterize
lower bounds; they are not invalidated by servers that require a substantial amount of time
to compute a response.
We model an execution of a system by a run, which is a sequence of timestamped events
involving clients, servers, and message queues. These events include sending messages, en-
queuing messages, receiving messages, and internal events that model computation at pro-
cesses. Two runs ai and a2 of the system are defined to be ind?stinguzshable to a process P if
the same sequence of events (with the same timestamps) occur at P in both ai and a2. We
assume that if two runs ai and a2 are indistinguishable to P and P has the same initial state
in both runs, then at any time t the state of P at tin ai is the same as the state of P at tin
a2. It is not hard to extend the definition of indistinguishability to handle nondeterministic
servers.
We assume that the clients can send any request at any time. If we impose restrictions
on the behavior of the clients, then we can derive protocols that violate the lower bounds in
this paper.
5Our protocols can be extended to the case where clocks are only approximately synchronized [14j.
6Another approach would be assume that servers are interconnected with redundant broadcast busses [2,
8]. We have not pursued this approach.
Define ? to be the potential causality relation [12] on server events e1 and e2. Thus ?
is the transitive closure of the following relation ?`?>: e1 e2 iff both e1 and e2 occur at
the same server 5 and e1 occurs before e2, or e1 is a send event and e2 is the corresponding
receive event. Informally, we say a request m is an update request if it changes the state of the
service in such a way that the responses to subsequent requests depend on m. More formally,
let m be a request with associated response r (and e(m) and e(r) be the events in the run
associated with the receipt of m and the sending of r respectively). Then m is an update
request if all request/response pairs m1Ir', where m' was sent after r, have e(m) ? e(r'). We
assume that update requests exist since otherwise the actions performed by the primary do
not have to be communicated to the backups.
We assume that failures occur independently from each other. We consider the following
hierarchy of failure models:
Crash failures: A server may fail by halting prematurely. Until it halts, it behaves correctly
Crash+Link failures: A server may crash or a link may lose messages (but links do not delay,
duplicate or corrupt messages).
Receive-Omission failures: A server may fail not only by crashing, but also by omitting to
receive some of the messages directed to it over a non-faulty link.
Send-Omission failures: A server may fail not only by crashing, but also by omitting to send
some of the messages over a non-faulty link [10].
General-Omission failures: A server may exhibit send-omission and receive-omission fail-
ures [20]).
Note that crash+link failures and the various types of omission failures are quite different.
Although all of these failure models concern loss of messages, each class of failures is dealt
with by a different masking technique. In particular, crashtlink failures can be masked
by adding redundant communication paths, while omission failures can only be masked by
adding redundant servers so that faulty processes can detect their own failure and halt. We
return to these masking techniques in Section 5.
Failures are counted by the number of failing components (either servers or links). We say
that a protocol tolerates f failures if it works correctly despite the failure of up to f faulty
components (note that each faulty component may fail many times during an execution).
4 Lower Bounds
For each failure model, we now give lower bounds for implementing a single (k, A)--Hbofo
server using the primary--Hbackup approach.
7The lower bounds we derive for crash failures also hold for fail-stop failures [21] except for the bound on
failover time. The lower bound on failover time depends on the maximum duration between when a server
p? fails and when fai1ed? becomes true.
4.1 Bounds on Replication
The first bound is obvious. However, to introduce our notation and the proof technique
that will be used later in the section, we give a formal proof of the theorem.
Theorem 1 Any primary--Hbackup protocol tolerating f crash failures requires n > f + 1
Proof: We prove the result by contradiction. Suppose there is a protocol P for n < f + 1.
Thus, P satisfies Pb4. Consider a run in which all n servers are crashed initially and clients
submit R > k [?/d] requests, where d is the minimum time between the sending of any
two requests (d > 0). By Pb4, at least one of these requests must elicit a response. This is
because the number of requests that cannot have responses must fall into at most k intervals
of length at most A, and each interval of A can contain at most [A/d] requests. However,
such a response is impossible since, by assumption, all servers have crashed. c
The following lemma is used in the rest of the theorems in this section.
Lemma 4.1 Consider any protocol that satisfies Pb4. Suppose two disjoint and nonempty
sets of servers A and B can be found that meet the following three properties:
1. There exists a run ?a containing R> 2k [A/d] requests where d is the minimum time
between the sending of any two client requests (d > 0). Furthermore, in this run the
servers in A do not crash and all other servers crash at time 0.
2. There exists a run j6 containing R requests. Furthermore, in this run the servers in
B do not crash and all other servers crash at time 0.
3.
There exists a run ?ab containing R requests. Furthermore, the servers in A and B do
not crash, ?ab is indistinguishable from aa to all servers in A, and aab is indistinguish-
able from ab to all servers in B.
At least one of the above runs violates Pb2.
Suppose for contradiction that the lemma is false and runs aa, ab and aab all
Proof:
satisfy Pb2.
For aa, by Pb4 at least R --H k [A/d] of the requests must have been received by servers in
A. Similarly, for ab, at least R --H k [A/d] of the requests must have been received by servers
in B. Finally, since aab is indistinguishable from aa to servers in A, they must execute the
same number of receive events in both runs. The same holds for the servers in B. By Pb2,
each request is sent to at most one server and so at least 2(R --H k [A/d]) requests must have
been sent in ?ab. Since only R requests were sent, we must have R > 2(R --H k[A/d]), or
R < 2k[A/d], which contradicts the assumption that R> 2k[A/d].
Theorems 2 and 3 depend on two parameters of primary--Hbackup protocols. Let F be the
maximum time that can elapse between any two successive client requests (possibly from
different clients), and let D be a duration such that if some server 5 becomes the primary at
time to and remains the primary through time t > to + D when a client c1 sends a request,
then Des4 = 8 at time t. Hence, D is the minimum delay until all clients know the identity
of a new primary. For simplicity of notation, we write D < F to mean that D is bounded
and F is either unbounded or bounded and greater than D. Note that when D < F the
service must be able to detect the failure of a primary and disseminate the new primary's
identity to the clients without using any messages from clients.
With both send-omission failures and crash+link failures, messages may fail to reach
their destinations. The following theorem shows that crash+link failures are more expensive
to tolerate, as they require more replication.
Theorem 2 Suppose there is at most one link between any two servers. Then any primary--H
backup protocol tolerating f crash+link failures and having D < F requires n > f + 2.
Proof: For contradiction, assume the existence of a protocol P with n ? f + 2. We
will show that P has three runs aa, ob and ?ab that satisfy the conditions of Lemma 4.1.
From the lemma, at least one of these runs violates Pb2, which implies that P cannot be a
primary--Hbackup protocol.
Let A be a set containing the one server 5a and let B be the set of remaining servers.
Since IA = 1 and RI = n --H 1 ? f, A and B can become disconnected by link failures.
We first construct the run ?ab in which no server crashes, postulating that the links
between the servers in A and B are faulty and do not deliver any messages. As required
by Lemma 4.1, clients send a total of R > 2k[A/d] requests. Let 0 ? d < F --H D be the
minimum interval between any two such requests. We postulate that a request will be sent
at time t iff no request has been sent during the interval [t --H d. .t) and one of the following
rules hold.
1. A server s is the primary during the interval [t --H D . .tJ. This request arrives immediately
and is enqueued (at s, by Pb3 and the definition of D).
2. There is no primary at time t. This request arrives immediately and by Pb3 will never
be enqueued at any server.
3.
A server 5 is the primary at time t but another server 5' is the primary immediately
after time t. If this request is sent to s, then it arrives after t, and if it is sent to any
other server, then it arrives immediately. In both cases, it arrives at a server that is
not the primary, and so will not be enqueued (again by Pb3).
Note that, by construction, the maximum interval between any two client requests is
D + d. This interval occurs when a server s becomes the primary just before d after a client
message is sent, and s remains the primary for at least D. Hence, the client will be able to
send R requests within time R(D + d). This completes the construction of ?ab
We now construct aa and a6, recalling that in aa all of the servers except ?a crash at time
0, and in ab server 5a crashes at time 0. The clients send the same requests and at the same
times in aa and in a6 as in aab. Furthermore, by construction these requests will arrive at
the servers according to the same rules used in constructing aa6. Of course, a client request
may not be delivered to the same servers in aa or a6 as in aab, since different servers are
operational in these runs.
Since 5a does not receive any messages from servers in B in either Crab or Cra, these two
runs are indistinguishable to 5a as long as it receives the same client requests at the same
times in both runs. We show that this is the case by contradiction: let t be the earliest time
that ?a can distinguish between these two runs.
Thus, at time t either 5a received a request m in Crab but not in Cra or it received a request
m in Cra but not in Crab. We will assume the former; the proof for the latter is similar. The
request m must have been enqueued at some time t' < t at 5a in Crab. Since m was received
by ?a, m must have been sent by rule 1. By rule 1, 5a must have been the primary through
--H D..t'] in Crab and therefore, by indistinguishability, in Cra as well. By the definition of D,
m would have been enqueued at ?a at time t'in Cra as well.
Since 5a cannot distinguish between the runs before t, ?a cannot receive m before tin Cra,
and Sa must execute a receive in both Cra and Crab at time t. So, it must be the case that s?
receives another request m' $ m at time tin Cra. Assume that m' was enqueued at time t".
By an indistinguishability argument similar to above, m must be enqueued at time t" at Sa
in Crab as well. Therefore, if s received m' in Cra at time t, it must receive m in Crab as well, a
contradiction.
A similar argument can be used to show that the servers in B receive the same requests
in ab and Crab, and so these two runs are indistinguishable to the servers in B. Thus, by
Lemma 4.1 P cannot be a primary--Hbackup protocol.
The assumption in this theorem that D ? F is significant. As we discuss in Section 5, when
D > F protocols that tolerate f crash+link failures can be constructed that use only f t 1
servers.
The next theorem states that additional replication is required in order to tolerate receive-
omission failures. The proof is similar to that of Theorem 2, and so it is omitted.
Theorem 3 Any primary--Hbackup protocol tolerating f receive-omission failures and having
D < F requires n>
The next lower bound holds independent of the relation between D and F
Theorem 4 Any primary--Hbackup protocol tolerating f general-omission failures requires
n> 2f.
Proof: Assume for contradiction that there is a protocol for n < 2f. Partition the servers
into two disjoint sets A and B of size at most f each. We will construct two runs Cr1 and Cr2.
In each run, one set of servers will be faulty and the other set will be non-faulty.
Cri: The servers in A are faulty and fail to communicate with all servers in B, but behave
correctly otherwise. Clients send update requests until the first response is sent (this must
happen, by Pb4). Assume that the first response r to an update request m is sent at time
t. Say that this response is sent by server 5.
Cr2: The same as Cr1 up to time t, but if s is in B, then in Cr2 it is the servers in I? that
are faulty and fail to communicate with all servers in A rather than the servers in A that are
faulty. In either case, r is sent by a faulty server. Furthermore, no server can distinguish Cr1
from Cr2 through time t and therefore, the first response r is sent at time tin Cr2 as well.
Let all of the faulty servers in a2 crash immediately after r is sent and have clients con-
tinue to send requests until another response r'is sent. This response must have been sent
by a non-faulty server which implies that m(e(m) ? e(r')). However this violates the fact
that m is an update request.
4.2 Bounds on Blocking
Informally, a blocking primary-backup protocol is one in which the primary must, after
receiving a request m, either receive a message from another server or simply wait an interval
before it can respond to m. Consider a failure-free run of a primary-backup protocol that
is handling a request. Let the time that the request is received be tin and the time that
the response is sent be tr. We say that this protocol is C--Hblocking if it is guaranteed that
tr --H tin < C holds. For example, any primary-backup protocol in which the primary sends
information about a request to the backups and waits for acknowledgement before sending
the response to the client will be at least 26--Hblocking.
As shown in Section 5, 0--Hblocking primary-backup protocols can be built for crash and
crash+link failure models. For servers that take no time to compute the response to a
request, the simple protocol tolerating crash failures presented in Section 2 is 0--Hblocking.
We call such protocols nonbiocking because the primary can send a reply to the client as soon
as the reply has been computed. Nonblocking protocols tolerating receive-omission failures
also exist as long as n > 2f, but there is can be no nonblocking primary--Hbackup protocol
tolerating send-omission or general--Homission failures.
Theorem 5 Any prnmary--Hbackup protocol tolerating f receive-omission failures with f > 1,
n < 2f and D ? F is C-blocking for some C > 26.
Proof: For contradiction, suppose there is a primary--Hbackup protocol for n < 2f and
f > 1 that is C--Hblocking where C < 26. Partition the servers into two sets A and B where
Al = f and jBj = n --H f < f. We construct three runs. In all three runs, assume that all
server messages take 6 to arrive.
a1: There are no failures and all client requests take 6 to arrive. Moreover, clients send
update requests until some update request m evokes a response r. Let m be received at time
tin by server p ? A and r be sent at time tr by a server q c A (q could be the same as
Notice that since the protocol is C--Hblocking where C < 26, tr --H tin ? 26. Also since, by
construction, all requests take 6 to arrive, all client requests sent after time tin t 6 will be
received after time tr.
a2: Identical to oi until p receives m at time tin. At this point in a2, all servers in A are
assumed to crash and clients are assumed to send no request during the interval [tin + 6. .trj.
Finally, after time tr clients are assumed to repeatedly send requests at intervals of at least
d where 0 ? d < F --H D as follows. A request is sent at time t if no request has been sent in
--H d. .t) and one of the following rules hold.
1. A server 5 ? B is the primary during the interval [t --H D..t]. This request arrives
immediately and is enquened (at 5, by Pb3 and the definition of D).
2. There is no primary in B at time t. This request arrives immediately by Pb3 will never
be enqueued at any server.
3.
A server 5 E B is the primary at time t but another server 5' E B is the primary
immediately after time t. If the request is sent to 8, then it arrives after t, and if it
is sent to any other server it arrives immediately. In both cases, it arrives at a server
that is no the primary, and so will not be enqueued (again, by Pb3).
Notice that eventually, there will be a response (say r') in a2 because the protocol satisfies
Pb4, and by construction it must be from a request sent by rule 1.
a3: The same as a2, except that the servers in A do not crash at time trn. Instead, the
servers in B commit receive failures on all messages sent after tm by servers in A. Clients
send requests at the same times as in a2 which arrive using the same rules as a2.
Now, consider these three runs. By construction, the runs are identical up to time t,n.
Since all server messages take 6 to arrive, clients cannot distinguish a1 and a3 through t? + 6,
and so clients send the same requests to the same servers in both a1 and a3. Similarly, since
all server messages take 6 to arrive, the servers in B cannot distinguish between a1 and a3
through tm + 6. Therefore, since tr --H tm < 26, p (the server that received request m at time
tm in a1) and q (the server that sent response r at time tr in a1) cannot distinguish between
a1 and a3 through time tr, and so q sends response r in a3 as well. Then, using an argument
similar to the one in Theorem 2, servers in B cannot distinguish a2 and a3, and so response
r' also occurs in a3. However, ?(e(m) ? e(r')) which violates the assumption that m is an
update request.
In run a3 of the above proof, a correct primary (pin set A) becomes the backup, while a
faulty server from set B becomes the primary in p's place. It is always possible to construct
such a run. This is a disconcerting property: there does not exist a primary--Hbackup protocol
that tolerates receive-omission failures with n < 2f in which a primary cedes only when it
fails. Moreover, this lower bound is tight in [6], we give a receive-omission primary--Hbackup
protocol with n = 2f + 1 in which a primary cedes only when it fails.
And, if f = 1, then the following theorem holds: its proof is similar to the proof of
Theorem 5 (and is therefore omitted), except that p =
Theorem 6 Any prnmary--Hbackup protocol tolerating receive-omission failures with f = 1
and n < 2f and having D <F is C-blocking for some C> 6.
Primary--Hbackup protocols tolerating send-omission or general--Homission failures exhibit
the same blocking properties as those tolerating receive-omission failures, except that the
restriction D < I, is no longer necessary. Here we prove just the results for send--Homission
failures. The results for general--Homission failures then follow.
Theorem 7 Any primary--Hbackup protocol tolerating f send-omission failures with f > 1 is
C--Hblocking for some C > 26
Proof: For contradiction, suppose there is a primary--Hbackup protocol that is C--Hblocking
where C ? 26. We consider the following two runs in which all server messages take 6 to
arrive.
a1: There are no failures and all client requests take 6 to arrive. Moreover, clients send
update requests until some update request m evokes a response r. Let m be received at
time tm by server p and r be sent at time tr by a server q (again q could be p). Notice that
since the protocol is C--Hblocking where C ? 26, tr --H tm ? 26. Also, since by construction all
requests take 6 to arrive, all client requests sent after time tm t 6 will be received after time
tr.
a2: Identical to al through tm. After trn, P and q fail and omit to send all messages to all
servers except each other. Since, by construction, all messages take 6 to arrive, servers and
clients cannot distinguish between a1 and a2 through t? t 6 and, as a result, p and q cannot
distinguish the two runs through tm t 26. Therefore, since tr --H tm < 26, q sends the response
r at time tr in a2 as well. Now let p and q crash at time tr and the clients send requests
after time tr. By Pb4, there eventually must be some request m' that results in a response
r'. However, ?(e(m) ? e(r')), which violates the assumption that m us an update request. E
Again, if f = 1, then the following theorem can be proved using a proof similar to
Theorem 7, except that p =
Theorem 8 Any primary--Hbackup protocol tolerating send-omission failures with f = 1 is
C--Hblocking for some C > 6.
4.3 Bounds on Failover Times
Recall that the failover time is the longest interval during which Prmy8 is not true for any
server 5. In this section, we give lower bounds for failover times.
In order to discuss these bounds, we postulate a fifth property of primary--Hbackup pro-
tocols.
Pb5: A correct server that is the primary remains so until there is a failure of some server
or link.
This is a reasonable expectation and it is valid for all protocols that we have found in the
literature.
Theorem 9 Any primary--Hbackup protocol tolerating f crash failures must have a failover
time of at least f6.
Proof: The proof is by induction on f.
Base case f = 0: trivially true since a failover time cannot be smaller than zero.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures, but (for
a proof by contradiction) there is a protocol P for which the theorem is false when there are
f failures. From the induction hypothesis, there is a run a with at most f --H 1 failures and
an interval [t0..t1] at least (f --H 1)6 during which there is no primary. Let pi be the server
that becomes the primary at t1. Consider the two runs a1 and a2 that extend a as follows:
a1: Assume Pi crashes at time ti. By assumption, there exists a new primary (say P2) at
time t2 ? t1 + ?. Since Pi crashes at time ti, P2 does not receive any messages from Pi that
were sent after time t1.
a2: Assume that Pi is correct, there are no other crashes at or after t1 and all messages
sent by Pi after time t1 take 6 to arrive.
Since P2 cannot distinguish cr1 from cr2 through time t2, P2 becomes the primary at time
t2 in cr2. By Pb5, however, Pi remains the primary at time t2 in cr2. This violates Pb1, and
so P is not a primary--Hbackup protocol.
Failover times for all other failure models have a larger lower bound:
Theorem 10 Any prnmary--Hbackup protocol toThrating f crash+link failures has a failover
time of at least 2f6.
Proof: The proof is by induction on
Base case f = 0: trivially true.
Induction case f > 0: Suppose the theorem holds for at most f --H 1 failures, but (for
a proof by contradiction) there is a protocol P for which the theorem is false when there are
f failures.
From the induction hypothesis, there is a run a with at most f --H1 failures and an interval
[t0..t1] at least 2(f --H 1)6 during which there is no primary. Let Pi be the server that becomes
the primary at ti. Consider the three runs a1, cr2 and, cr3 that extend cr as follows:
a1: Assume that Pi crashes at time ti and all messages sent after t1 take 6 to arrive.
Furthermore, the crash of Pi is the only failure at or after ti. By assumption, there exists a
new primary (say P2) at time t2 ? t1 + 26. Since Pi crashes at time ti, P2 does not receive
any messages from Pi that were sent after time t1. Furthermore, since all messages take 6 to
arrive, any message that was sent after t1 + 6 can be received by P2 only after time t2.
a2: Assume that Pi is correct, there are no other failures at or after ti, and all messages
sent after time t1 take 6 to arrive. Since there are no failures at or after time ti, by Pb5 Pi
continues to be the primary through time t2.
a3: The same as cr2 except that the link between Pi and P2 is faulty and does not deliver
any message sent by Pi to P2 after time t1.
By construction, P2 cannot distinguish a1 from cr3 through time t2, and 50 P2 becomes
the primary at time t2 in cr3. Similarly, Pi cannot distinguish cr2 from a3 through time t2
and 50 Pi remains the primary until time t2 in cr3. This violates Pb1, and so P is not a
primary-backupprotocol.			E]
We omit the proofs of the following two theorems because they are similar to Theorem 9.
Theorem 11 Any primary--Hbackup Protocol tolerating f receive?omissionfailures has a failover
time of at least 2f6.
Theorem 12 Any primary--Hbackup protocol tolerating f send-omission failures has a failover
time of at least 2f6.
5 Outline of the Protocols
In order to establish that the bounds given above are tight, we have developed primary--H
backup protocols for the different failure models [6]. In this section, we outline these protocols
and discuss which of our lower bounds are tight.
Our protocol for crash failures is similar to the protocol given in Section 2. Whenever the
primary receives a request from the client, it processes that request and sends information
about state updates to the backups before sending a response to the client. All servers
periodically send messages to each other in order to detect server failures. This protocol
uses (f + 1) servers and so the lower bound in Theorem 1 is tight. Furthermore, it is
nonblocking and so incurs no additional delay. It has the failover time f? + T for arbitrarily
small and positive T, and so the lower bound in Theorem 9 is tight.
In order for the protocol to tolerate crash+link failures, we add an additional server. By
Theorem 2, this server is necessary. The additional server ensures that there is always at
least one non-faulty path between any two correct servers, where a path contains zero or
more intermediate servers. The protocol for crash failures outlined above is now modified so
that a primary ensures any message sent to a backup is sent across at least one non-faulty
path. This protocol uses (f + 2) servers and so Theorem 2 is tight. Furthermore, it is
nonblocking and so incurs no additional delay. It has the failover time 2f? + T for arbitrarily
small and positive T, and so Theorem 10 is tight.
Most of our protocols for the various kinds of omission failures apply translation tech-
niques [17] to the protocol for crash failures outlined above. These techniques ensure that
a faulty server detects its own failure and halts, thereby translating a more severe failure
to a crash failure. The translations of [17] assume a round-based protocol. Since our crash
failure protocol is not round-based, we must modify the translations so that a server can
send and receive messages at any time rather than just at the beginning or the end of a
round, All of these resulting omission--Hfailure protocols have failover time 2f6 + T, and thus
Theorems 11 and 12 are tight. The protocol for send-omission failures uses f + 1 servers and
is 2? + ?--Hblocking. Furthermore, we also have a send-omission protocol for f = 1 that is
?--Hblocking. Thus, Theorems 7, 8 and 12 are tight. The protocol for general-omission failures
also uses 2f + 1 servers and is 26 + ?--Hblocking, and so Theorem 4 is tight, and Theorems 7
and 12 are tight for general-omission failures as well.
We have not been able to determine whether Theorems 3 and 5 are tight. Our protocol
for receive-omission failures uses 2f + 1 servers whereas the lower bound in Theorem 3 only
requires n > [?j. We have constructed protocols for n = 2, f = 1 and n = 4, f = 2 but
are unable to generalize these protocols. We can also show that any protocol for n < 2f has
the following odd property: there is at least one run of the protocol in which a non-faulty
primary is forced to relinquish control to a backup that is faulty. However, the protocol for
n = 2, f = 1 is 6--Hblocking and so Theorem 6 is tight.
Table 1 summarizes all of our results.
failure			degree of			amount of			failover
model			replication			blocking			time
crash			n> f			0
crash+link			n> f t 1			0			2f?
receive			__ *t			? ifn<2fandf=1
omission			n>			2? ffn<2fandf>1*t			2f6
general			? if f=1
omissionnot n > 2f			2? iff> 1			2f?
* Bound known to be tight.
D<l'
6 Discussion
Table 1: Lower Bounds.
We give a precise characterization for primary--Hbackup protocols in a system with synchro-
nized clocks and bounded message delays. We then present lower bounds on the degree of
replication, the blocking time, and the failover time under various kinds of server and link
failures. We finally outline a set of primary--Hbackup protocols that show which of our lower
bounds are tight.
We now briefly compare our results to existing primary--Hbackup protocols. The protocol
presented in [3] tolerates one crash+link failure by using only two servers. This appears to
contradict Theorem 2 which states that at least three servers are required to tolerate one
failure. However, the protocol in [3] assumes that there are two links between the two servers,
effectively masking a single link failure. Hence, only crash failures need to be tolerated, and
this can be accomplished using only two servers (Theorem 1).
A more ambitious primary--Hbackup protocol is presented in [15]. This protocol works for
the following failure model (quoted from [15]):
The network may lose or duplicate messages, or deliver them late or out of order;
in addition it may partition so that some nodes are temporarily unable to send
messages to some other nodes. As is usual in distributed systems, we assume
the nodes are fail-stop processors and the network delivers only uncorrupted
messages.
This failure model is incomparable with those in the hierarchy we presented. However, the
protocol does tolerate general-omission failures and has optimal degree of replication for
general-omission failures as it uses 2f + 1 servers.
In Theorem 2, we assumed that D < F. This assumption is crucial: we are able to
construct a two-server primary--Hbackup protocol tolerating one crash+link failure for which
D > F. Recall that link failures are masked by adding redundant paths between the servers.
Our two-server craslitlink protocol essentially uses the path from the primary to the backup
through the client as the redundant path. Thus, there appears to be a tradeoff between the
degree of replication and the time it takes for a client to learn that there is a new primary.
The lower bounds on failover times given in Section 4.3 assume Pb5. This is necessary as
we have constructed protocols that have failover times smaller than the lower bounds given
in Section 4.3 and these protocols do not satisfy Pb5. This smaller failover time is achieved
at a cost of an increased variance in service response time.
Finally, in this paper we have attempted to give a characterization of primary--Hbackup
that is broad enough to include most synchronous protocols that are considered to be in-
stances of the approach. There are protocols, however, that are incomparable to the class of
protocols we analyze [5,16,18] since they were developed for an asynchronous setting. Such
protocols cannot be cast in terms of implementing a (k, A)--Hbofo server for finite values of k
and A. We are currently studying possible characterizations for a primary--Hbackup protocol
in an asynchronous system and hope to extend our results to this setting.
Acknowledgements
We would like to thank Lorenzo Alvisi, Mike Reiter and the anonymous conference referees
for their helpful comments on earlier drafts of this paper.
References
[1] P.A. Alsberg and J.D. Day. A principle for resilient sharing of distributed resources.
In Proceedings of the Second International Conference on Software Engineering, pages
627--H644, October 1976.
[2] O?zalp Babaoglu and Rogeflo Drummond. Streets of Byzantium: Network architectures
for fast reliable broadcasts. IEEE Transactions on Software Engineering, 11(6):546--H554,
June 1985.
[3]
J.F. Barlett. A nonstop kernel. In Proceedings of the Eighth ACM Symposium on
Operating System Principles, SIGOPS Operating System Review, volume 15, pages 22--H
29, December 1981.
[4] Anupam Bhide, E.N. Elnozahy, and Stephen P. Morgan. A highly available network file
server. In USENIX, pages 199--H205,1991.
[5]
[6]
Kenneth P. Birman and Thomas A. Joseph. Exploiting virtual synchrony in distributed
systems. In Eleventh ACM Symposium on Operating System Principles, pages 123--H138,
November 1987.
Navin Budhiraja, Keith Marzullo, Fred Schneider, and Sam Toueg. Optimal primary--H
backup protocols. In Proceedings of the Sixth International Workshop on Distributed
Algorithms, Haifa, Israel, November 1992. To Appear.
[7] IBM International Technical Support Centers. IBM/VS extended recovery facility
(XRF) technical reference. Technical Report GG24-3153-0, IBM, 1987
[8] Flaviu Cristian. Synchronous atomic broadcast for redundant broadcast channels. Jour-
nal of Real-Time Systems, 2:195--H212, September 1990.
[9] Flaviu Cristian, Houtan Aghili, H. Ray Strong, and Danny Dolev. Atomic broadcast:
From simple message diffusion to Byzantine agreement. In Proceedings of the Fifteenth
International Symposium on Fault-Tolerant Computing, pages 200--H206, Ann Arbor,
Michigan, June 1985. A revised version appears as IBM Technical Report RJ5244.
[10]
Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. PhD thesis,
Harvard University, June 1984. Department of Computer Science Technical Report
11-84.
[11] Thomas Joseph and Kenneth Birman. Reliable Broadcast Protocols, pages 294--H318.
ACM Press, New York, 1989.
[12] Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7):558--H565, July 1978.
[13] Leslie Lamport and Michael Fischer. Byzantine generals and transaction commit pro-
tocols. Op. 62, SRI International, April 1982.
[14] Leslie Lamport and P. M. Melliar-Smith. Synchronizing clocks in the presence of faults.
Journal of the ACM? 32(1):52--H78, January 1985.
[15] Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, and Michael
Williams. Replication in the Harp file system. In Proceedings of the 13th Symposium
on Operating System Principles, pages 226--H238, 1991.
[16] Timothy Mann, Andy Hisgen, and Garret Swart. An algorithm for data replication.
Technical Report 46, Digital Systems Research Center, 1989.
[17] Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed
systems. In Proceedings of the Seventh ACM Symposium on Principles of Distributed
Computing, pages 248--H262, Toronto, Ontario, August 1988. ACM SIGOPS-SIGACT.
[18] B. Oki and Barbara Liskov. Viewstamped replication: A new primary copy method to
support highly available distributed systems. In Seventh ACM Symposium on Principles
of Distributed Computing, pages 8--H17, Toronto, Ontario, August 1988. ACM SIGOPS-
SIGACT.
[19] M. Pease, R. Shostak, and Leslie Lamport. Reaching agreement in the presence of
faults. Journal of the ACM, 27(2):228--H234, April 1980.
[20] Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor
and communication faults. IEEE Transactions on Software Engineering, 12(3):477--H482,
March 1986.
[21]
Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to
designing fault-tolerant computing systems. ACM Thansactions on Computer Systems,
1(3):222--H238, August 1983.
[22] Fred B. Schneider. Implementing fault tolerant services using the state machine ap-
proach: A tutorial. Computing Surveys, 22(4):299--H319, December 1990.
