BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR94-1417
ENTRY:: 1994-04-19
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Refinement for Fault-Tolerance: An Aircraft Hand-off Protocol
AUTHOR:: Marzullo, Keith
AUTHOR:: Schneider, Fred B. 
AUTHOR:: Dehn, Jon
DATE:: April 1994
PAGES:: 18
ABSTRACT::
Part of the Advanced Automation System (AAS) for air-traffic control is a 
protocol to permit flight hand-off from one air-traffic controller to another. 
The protocol must be fault-tolerant and, therefore, is subtle--an ideal 
candidate for the application of formal methods. This paper describes a formal 
method for deriving fault-tolerant protocols that is based on refinement and 
proof outlines. The AAS hand-off protocol was actually derived using this 
method; that derivation is given.
END:: CORNELLCS//TR94-1417
BODY::
Refinement for Fault-Tolerance:
An Aircraft Hand-off Protocol
Keith Marzullo*
Fred B. Schneider**
Jon Dehn
TR 94-1417
April 1994
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
* This author is supported in part by the Defense Advanced Research Projects
Agency under NASA Ames grant number NAG 2-593, Contract N00140-87-C-8904
and by AFOSR grant number F496209310242. The views, opinions, and findings
contained in this report are those of the author and should not be construed as an
official Department of Defense position, policy, or decision.
** This author is supported in part by the Office of Naval Research under contract
N00014-91-J-1219, AFOSR under proposal 93NM312, the National Science
Foundation under Grant CCR-8701 103, and DARPNNSF Grant CCR-9014363. Any
opinions, findings, and conclusions or recommendations expressed in this
publication are those of the author and do not reflect the views of these agencies.
Refinement for Fault-Tolerance:
An Aircraft Hand-off Protocol
Keith Marzullo*
University of California at San Diego
Dept. of Computer Science & Engineering
La Jolla, CA 92093-0114
Fred B. Schneidert
Cornell University
Dept. of Computer Science
Ithaca, NY 14853-7501
30 January 1993
Abstract
Jon Dehn
IBM Corp.
9231 Corporate Blvd.
Rockville, MD 20850
Part of the Advanced Automation System (AAS) for air-traffic con-
trol is a protocol to permit flight hand-off from one air-traffic con-
troller to another. The protocol must be fault-tolerant and, therefore,
is subtle--Han ideal candidate for the application of formal methods.
This paper describes a formal method for deriving fault-tolerant proto-
cols that is based on refinement and proof outlines. The AAS hand-off
protocol was actually derived using this method; that derivation is
given.
1 Introdution
The next-generation air traffic control system for the United States is air-
rently being built under contract to the U.S. government by the IBM Federal
Systems Company (recently acquired by Loral Corp.). Advanced Automation
`This author is supported in part by the Defense Advanced Research Projects Agency
under NASA Ames grant number NAG 2-593, Contract N00140-87-C-8904 and by AFOSR
grant number F496209310242. The views, opinions, and findings contained in this report
are those of the author and should not be construed as an official Department of Defense
position, policy, or decision.
tThis author is supported in part by the office of Naval Research under contract N00014-
9?4421% AFGSR under proposal 93NM312, the National Science Foundation under Grant
CCR?7O1?O3, and DARPA/NSF Grant CCR-9014363. Any opinions, findings, and conclu-
sions or recommendations expressed in this publication are those of the author and do not
reflect the views of these agencies.
System (AAS) [1] is a large distributed system that must function correctly,
even if hardware components fail.
Design errors in AAS software are avoided and eliminated by a host
of methods. This paper discusses one of them--Hthe formal derivation of a
protocol from its specification--Hand how it was applied in the AAS protocol
for transferring authority to control a flight from one air-traffic controller to
another. The flight hand-off protocol we describe is the one actually used
in the production AAS system (although the protocol there is programmed
in Ada). And, the derivation we give is a description of how the protocol
actually was first obtained.
The formal methods we use are not particularly esoteric nor sophisti-
cated. The specification of the problem is simple, as is the characterization
of hardware failures that it must tolerate. Because the hand-off protocol is
short, computer-aided support was not necessary for the derivation. De-
riving more complex protocols would certainly benefit from access to a
theorem prover.
We proceed as follows. The next section gives a specification of the
problem and the assumptions being made about the system. Section 3
describes the formal method we used. Finally, Section 4 contains our
derivation of the hand-off protocol.
2 Specification and System Model
The air-traffic controller in charge of a flight at any time is determined by
the location of the flight at that time. However, the hand-off of the flight
from one controller to another is not automatic: some controller must issue
a command requesting that the ownership of a flight be transferred from
its current owner to a new controller. This message is sent to a process that
is executing on behalf of the new controller. It is this process that starts the
execution of the hand-off itself.
The hand-off protocol has the following requirements:
Pi: No two controllers own the same flight at the same time.
P2: The interval during which no controller owns a flight is brief (approx-
imately one second).
F3: A controller that does not own a flight knows which controller does
own that flight.
2
The hand-off protocol is implemented on top of AAS system software
that implements several strong properties about message delivery and exe-
cution time [1]. For our purposes1 we simplify the system model somewhat
and mention only those properties needed by our hand-off protocol.
The system is structured as a set of processes running on a collec-
tion of processors interconnected with redundant networks. The services
provided by AAS system software include a point-to-point FIFO interpro-
cess communication facility and a name service that allows for location-
independent interprocess communication. AAS also supports the notion
of a resilient process s comprising a pflmaflj process s.p and a backup process
s.b. The primary sends messages to the backup so that the backup's state
stays consistent with the primary. This allows the backup to take over if
the primary fails.
A resilient process is used to implement the services needed by an
air-traffic controller, including screen management1 display of radar in-
formation, and processing of flight information. We denote the primary
process for a controller C as C.p and its backup process as C.b. if C is the
owner of a flight f, then C.1) can execute commands and send messages
that affect the status of flight f; C.b, like all backup processes in AAS, only
receives and records information from C.p in order to take over if c.p fails.
AAS implements a simple failure model for processes [3j:
Si: Processes can fail by crashing. A crashed process simply stops exe-
cuting without otherwise taking any erroneous action.
52: if a primary process crashes, then its backup process detects this and
begins executing a user-specified routine.
Property 52 is implemented by having a failure detector service. This
service monitors each process and, upon detecting a failure, notifies any
interested process.
if the hand-off protocol runs only for a brief interval of time, then
it is safe to assume that no more than a single failure will occur during
execution. So, we assume:
53: In any execution of the hand-off protocol, at most one of the partici-
pating processes can crash.
Messages in transit can be lost if the sender or receiver of the message
crashes. Otherwise, messages are reliably delivered, without corrup-
tion, and in a timely fashion. No spurious messages are generated.
3
54:
We also can assume that messages are not lost due to failure of net-
work components such as controllers and repeaters. This is a reasonable
assumption because the processors of AAS are interconnected with redun-
dant networks and it is assumed that no more than one of the networks
will fail.
In any long-running system in which processes can fall, there must be a
mechanism for restarting processes and reintegrating them into the system.
We ignore such issues here because that functionality is provided by AAS
system software. Instead, we assume that at the beginning of a hand-off
from A to B, all four processes A.p, A.b, B.p, B.b are operational.
3 Fault-tolerance and Refinement
A protocol is a program that runs on a collection of one or more processors.
We indicate that 5 is executed on processor p by writing:
(S)atp			(1)
Execution of (1) is the same as skip if p has failed and otherwise is the same
as executing 5 as a single, indivisible action. This is exactly the behavior
one would expect when trying to execute an atomic action 5 on a fail-stop
processor.
Sequential composition is indicated by juxtaposition.
(Si)atpi (5??atp?			(2)
This statement is executed by first executing (Si ? at pi and then executing
(S2? at p2. Notice that execution of (S2? at p2 cannot assume that Si has
actually been performed. If pi fails before execution of (Si ? at pi completes,
then the execution of (Si at pi is equivalent to skip. Second, observe that
an actual implementation of (2) when Pi and p2 are different will require
some form of message-exchange in order to enforce the sequencing.
Finally, parallel composition is specified by:
cobegin (Si? atpi 11(52) atp? .11 (5n) atp? coend (3)
This statement completes when each component (S?) at p? has completed.
Since some of these components may have been assigned to processors that
fail, all that can be said when (3) completes is that a subset of the Si have
been performed. If, however, we also know the maximum number t of
failures that can occur while (3) executes, then at least n --H t of the S? will
be performed.
4
Proof Outlines
We use proof outlines to reason about execution of a protocol. A proof
outline is a program that has been annotated with assertions, each of which
is enclosed in braces. A precondition appears before each atomic action, and
a postcondition appears after each atomic action. Assertions are Boolean
formulas involving the program variables. Here is an example of a proof
outline.
Xl:
X2:
fx = 0 A y = OJ
x := x + 1
= 1 A y =
y := y $ 1
= 1 A y = 11
In this example, x = 0 A y = 0, x = 1 A y = 0, and x = 1 A y = 1 are
assertions. Assertion x = 0 A y = 0 is the precondition of Xl, denoted
pre(X1), and assertion x = 1 A y = 0 is the postcondition of Xl, denoted
post(X2). The postcondition of Xl is also the precondition of X2.
A proof outline is valid if its assertions are an accurate characterization
of the program state as execution proceeds. More precisely, a proof outline
is valid if the proof outline invariant
A((at(S) pre(S)) A (after(S) post(S)))
s
is not invalidated by execution of the program, where at(S) is a predicate
that is true when the program counter is at statement 5, and after(S) is a
predicate this is true when the program counter is just after statement 5.
The proof outline above is valid. For example, execution starting in
a state where x = 1 A y = 0 A after(X1) is true satisfies the proof outline
invariant and, as execution proceeds, the invariant remains true. Notice,
our definition of validity allows execution to begin anywher?even in
the middle of the program. Changing post(X1) (and pre(X2)) to x = 1
destroys the validity of the above proof outline. (Start execution in state
x = 1 Ay = 23Aafter(X1). The proof outline invariant will hold initially
but is invalidated by execution of X2.)
A simple set of (syntactic) rules can be used to derive valid proof out-
lines. The first such programming logic was proposed in [2j. The logic that
we use is a variant of that one, extended for concurrent programs [4].
5
Additional extensions are needed to derive a proof outline involving
statements like (?). Here is a rule for (1); it uses the predicate up(p) to assert
that processor p has not failed.
?AJS?BJ
Action at Processor: fAJ (S? at p fup(p) ?
Since execution of (S? at p when p has crashed is equivalent to a skip,
one might think that
?A? (3') atp ?(up(p) ? B) A (?up(p) ? A?			(4)
should be valid if ?AJ 5 ?BJ is. Proof outline (4), however, is not valid.
Consider an execution that starts in a state satisfying A and suppose p has
not crashed. According to the rule's hypothesis, execution of 5 would
produce a state satisfying B. If process p then crashed, the state would
satisfy ?up(p) A B. Unless B implies A, the postcondition of (4) no longer
holds.
The problem with (4) is that the proof outline invariant is invalidated
by a processor failure. The predicate up(p) changing value from true to false
causes the proof outline invariant to be falsified. We define a proof outline
to be Jault-invariant with respect to a class of failures if the proof outline
invariant is not falsified by the occurence of any allowable subset of those
failures.
For the hand-off protocol, we are concerned with tolerating a single
processor failure. We, therefore, are concerned with proof outlines whose
proof outline invariants are not falsified when up(p) becomes false for a sin-
gle processor (provided up(p) is initially true for all processors). Checking
that a proof outline is fault4nvariant for this class of failures is simple:
Fault-Invariance: For each assertion A:
(A A Aup(p)) ? A,A[up(p'):=ffilse]
p			p
where L[x:= e] stands for L with every free occurrence of x replaced by e.
4 Derivation of the Hand-off Protocol
Let CTR(f) be the set of controllers that own flight f. Property Pi can then
be restated as
6
Pi': ICTR(f)I < 1.
Desired is a protocol Xfer(A, B) satisfying
CTR(f) A Pi')
Xfer(A, B)
?B E CTR(f) A Pi')
such that Pi' holds throughout the execution of Xfer(A, B).
A simple implementation of this protocol would be to use a single
variable ctr(f) that contains the identity of the controller of flight f and to
change ctr(f) with an assignment statement:
(A ? ctr(f) A Pi')
ctr(f):= (ctr(f) --H
E ctr(f) A Pi')
fA))u (B)
This implementation is problematic because the variable ctr(f) must
reside at some site. Not only does this lead to a possible performance
problem, but it makes determining the owner of f dependent on the avail-
ability of this site. Therefore, we represent CTR(f) with a Boolean variable
C.ctr(f) at each site C, where
CTR(f): (CIC.ctr(f)).
By doing so, we now require at least two separate actions in order to
implement Xfer(A, B#one action that changes A.ctr(f) and one action
that changes B.ctr(f). Using the Action at Processor Rule, we get:
Xl:
X2:
(A c CTR(f) A PI')
(A.ctr(f):= fa1se? at A
((up(A) ? ((A ? CTR(f)) A (CTR(f) = ?))) A Pi')
(CTR(f) =
(B.ctr(f):= true? at B
((up(B) ? (B ? GTR(f))) A Pi'))
Note that pre(X2) must assert that CTR(f) = ? holds, since otherwise exe-
cution of X2 invalidates Pi'.
The preconditions of Xl and X2 are mutually inconsistent, so these
statements cannot be executed in parallel. Moreover, X2 cannot be run first
because pre(X2), CTR(f) = ?, does not hold in the initial state. Thus, X2
7
must execute after Xl. Unfortunately1 post(Xl) does not imply pre(X2); if
up(A) does not hold, then we cannot assert that CTR(f) = ? is true. This
should not be surprising: ff A fails, then it cannot relinquish ownership.
One solution for this availability problem is to employ a resilient pro-
cess. That is, each controller C will have a primary process C.pand a backup
process C.b executed on processors that fail independently. Each process
has its own copy of C.ctr(f), and these copies will be used to represent
C.ctr(f) in a manner that tolerates a single processor failure:
C.p.ctr(f) if up(C.p)
.ctrf . C.b.ctr(f) if ?up(C.p)
Since we assume that there is at most one failure during execution of
the protocol1 the above definition never references the variable of a failed
process. Replacing references to processor "A" in Statement Xl with "A.p"
produces the following:
Xla:
(A ? CTR(f) A Pi')
(A.p.ctr(f):= fame) at A.p
((up(Aj?) ? ((A ? CTR(f)) A (CTR(f) = ?))) A fl')
This proof outline is not fault-invariant, however. If A.p were to fail
when the precondition holds, then the precondition might not continue
to hold. In particular, if A E CTR(f) holds because A.p.ctr(f) is true and
A.b.ctr(f) happens to be false, then when A.p fails, A E CTR(f) would not
hold. We need to assert that A.p.ctr(f) = A.b.ctr(f) also holds whenever
pre(Xla) does. We express this condition using the following definition:
Pr. (up(A.p) A up(A.b)) ? (A.b.ctr(f) = A.p.ctr(f))
Note that if one of A.p or A.b has failed then A.p.ctr(f) and A.b.ctr(f) need
not be equal. Adding Pr to pre(Xla) gives the following proof outline,
which is fault-invariant for a single failure:
Xla:
(A E CTR(f) A Pi' A Pr)
(A.p.ctr(f):= false) atA.p
((up(A.p) ? ((A ? CTR(f)) A (CTR(f) = ?))) A Pi')
We need more than just Xla to implement Xl, however. Xla does not
re-establish Pr, which must hold for subsequent ownership transfers. This
8
suggests that A.b.ctr(f) also be updated. Another problem with Xla is
that post(Xla) still does not imply pre(X2): if up(A.p) does not hold, then
CTR(f) = ? need not hold.
An action whose postcondition implies
up(A.b) ? (?A.b.ctr(f) A (?up(A.p) ? (CTR(f) =
suffices. By our assumption, up(A.p) v up(A.b) holds, so this postcondition
andpost(X?a) will together allow us to conclude CTR(f) = ? holds, thereby
establishing pre(X2). Here is an action that, when executed in a state
satisfying pre(Xla), terminates with the above assertion holding:
Xlb:
?(A E CTR(f)) A Pi' A FrJ
(A.b.ctr(f):= fr?The? atAb
?up(A.b) ? (?A.b.ar(f) A (?up(A.p) ? (CTR(f) =
One might think that since Xla and Xlb have the same preconditions they
could be run in parallel, and the design of the first half of the protocol
would be complete. Unfortunately, we are not done yet.
The original protocol specification implicitly restricted permissable own-
ership transitions. Written as a regular expression, the allowed sequence
of states is:
(CTR(f) = ?A?? (CTR(f) = ?? (CTR(f) =
(5)
That is, first A owns the flight, then no controller owns the flight for zero
or more states, and finally B owns the flight. The proof outline above does
not tell us anything about transitions; it only tells that Pi' holds throughout
(because Pi' is implied by all assertions). We must strengthen the proof
outline to deduce that only correct transitions occur.
A regular expression (like the one above) can be represented by a finite
state machine that accepts all sentences described by the regular expression.
Furthermore, a finite state machine is characterized by a next-state transi-
tion function. The following next-state transition function ?AB characterizes
the finite state machine for (5):
?AB:
??Ay,?,?BJJ ifA E CTR(f)
f?,?B)?			ifcTR(f)=?
f?B11			ifBECTR(f)
The value of 6AB is the set of values of CTR(f) that are next allowed for the
protocol. For example, when CTR(f) = ? holds, 6AB says that a transition to
9
a state in which CTR(f) = ? holds or to a state in which CTR(f) = fBi holds
are the only permissible transitions. Note that since Pi' holds, we know
that the three cases A E CTR(f), CTR(f) = ?, B E CTR(f) are mutually
exclusive, 50 ?AB always has a unique value.
We further define 06AB to be the value of 6AB in the previous state
during the execution of the hand-off protocol1 or ffAi, ?, fBii if there is
no previous state. Our hand-off protocol only will make permissable state
transitions provided each assertion implies that CTR(f) E 06AB; that is,
provided the current owner of f is one of the owners that was acceptable
as the "next owner" in the previous state of the system.
We therefore add conjunct CTR(f) E 06AB to the assertions in the proof
outline and check to see if the stronger proof outline is valid. If it is valid,
then we can move on to implementing X2, the second part of Xfer(A, B);
otherwise, we will have reason to make further modifications.
Here is the (strengthened) proof outline with Xla and Xlb running in
parallel:
fCTR(f) E 06AB A A E CTR(f) A Pi' A Fri
cobegin
fCTR(f) E 06AB A A c CTR(f) A Pi' A Fri
Xla : (A.p.ctr(f):= ffiThe? at A.p
(CTR(f) ? 06AB A
(up(A.p) ? (A ? CTR(f) A (CTR(f) = ?)) A Pi'?
(CTR(f) C 06AB A (A E CTR(f)) A Pi' A Pr)
Xib : (A.b.ctr(f):= Jalse) atAb
(CTR(f) E ?6?? A
(up(A.b) ? (??A.b.ar(f) A (?up(A.p) ? (CTR(f) =
coend
fCTR(f) E 06AB A
(up(A.p) ? (A ? CTR(f) A (CTR(f) = ?))) A
(up(A.b) ? (?A.b.ar(f) A (?up(A.p) ? (CTR(f) =
A Pi' A Fri
Unfortunately, this proof outline is not fault-invariant. If A.p fails in a state
satisfying after(Xla) A at(Xib) then the following holds before the failure:
after(Xla) A at(Xlb) A up(A.p)
A up(A.b) A (CTR(f) =
A ??A.p.ar(f) A A.b.ctr(f) A 06AB = f?, Bi
After the failure, we have:
after(Xla) A at(Xlb) A ?up(A.p)
A up(A.b) A (A E CTR(f))
A ?A.p.ctr(f) A A.b.ctr(f) A ???? = ??, BJ
So, CTR(f) E 06AB does not hold after the failure, and the first conjunct of
post(Xla) is invalidated. One simple solution is to preclude states where
at(Xlb) A after(Xla) holds. This can be done by running the two actions in
sequenc?first Xlb and then Xla. The result is described by the following
proof outline:
Xlb:
Xla
(CTR(f) E 06AB A Pi' A Pr A A E CTR(f))
(A.b.ctr(f):= Jalse? at A.b
(CTR(f) ? 06AB A Pi' A A.p.ctr(f) A
(up(A.b) ? ?A.b.ar(f))?
(A.p.ctr(f):= faThe) at A.p
(CTR(f) E 06AB A Pi' A (up(A.b) ? ?A.b.ar(f))
A (up(A.p) ? ??A.p.ar(f)) A Pr?
therefore, according to the definitions of CTR(f) and A.ctr(f),
?CTR(f) E 06AB A A ? CTR(f) A Pi' A PrJ
What we really want to conclude in post(X?a)? however, is CTR(f) =
not just A ? CTR(f). This is easily done by strengthening the above proof
outline with the following:
POniy(A): For all controllers C: C # A: C ? CTR(f)
POniy(A) is initially true because A E CTR(f) A Pi' holds, It is not
invalidated by any assignment, because the only variables assigned to are
those of A.p and A.b. So, Poniy(A) remains true throughout the execution
ofXl.
The derivation of a protocol for X2 is basically the same, except that A
is replaced by B and false is interchanged by true. Doing so results in the
proof outline shown in Figure 1.
4.1 Implementing P3
Pwperty P3 of Section 2 is satisfied by the protocol in Figure 1 as long as
there are exactly two controllers. When there are more than two controllers,
11
Xlb
X?a:
X2b:
X2a:
?CTR(f) E 06AB A PI' A Pr A POnly(A) A A E CTR(f))
(A.b.ar(f):= fa1se? at A.b
?CTR(f) E 06AB A Pi' A POniy(A) A
A.p.ctr(f) A B ? CTR(f) A (up(A.b) ? ?A.b.ar(f))Y
(A.p.ctr(f):--H fa1se? at A.p
fCTR(f) E 0?B A Pi' A Pr A POnly(A)
AB?CTR(f)
A (up(A.b) ? ?A.b.ctr(f))
A (up(A.p) ? ?iA.p.ar(f))J
therefore, according to the definitions of CTR(f),
A.ctr(f), and POniy(B)
?GTR(f) ? 0?B A Pi' A Pr A POniy(B) A CTR(f)
(B.b.ctr(f):= true) at B.b
?CTR(f) E 0?B A Pi' A POniy(B) A ?B.p.ar(f) A
(up(B.b) ? B.b.ctr(f))1
(B.p.ctr(f):= true) at B.p
?CTR(f) E 06?B A Pi' A PrA POniy(B)
A (up(B.b) ? B.b.ctr(f)) A (up(B.p) ? B.p.ctr(f))
therefore, according to the definitions of CTR(f)
and B.ctr(f),
? 06?B A Pi' A PrA POniy(B) A B E CTR(f)?
Figure 1: Hand-off Protocol for A and B
12
a controller must query other controllers in order to determine which owns
a flight. Doing so is inefficient, so we instead consider having each con-
troller C maintain a variable C.ctriD(f) that names the owner of flight f.
As with C.ctr(f), we represent the value of C.ctrlD(f) in a manner that
tolerates a single site failure:
C.ctrlD(f):			75&ftiriI?D((?)) if up(C.p)			(6)
if ?up(C.p)
This variable can be used to implement the Boolean C.ctr(f) by defining:
C.ctr(f): (C.ctrlD(f) =
Thus, the assignment "C.ctr(f):= true" would be replaced by "C.ctriD(f) : =
C", and "C.ctr(f):= false" would be replaced by "C.ctrlD(f):= X" for any
valueX #
We can rewrite P3 as the following:
P3': (?C: (C.ctriD(f) =
(?C: (C.ctriD(f) = C) A (VC': C'.ctriD(f) =
For the protocol of Figure ?, P3' holds when at(X2b) is true because the
the antecedent is false. Furthermore, if we explicitly assign A.ctrlD(f):= B
as the assignments Xlb and Xla, then P3' holds throughout the execution,
provided C' ranges over the set ?A, B?. For the other controllers, additional
statements are needed, shown in Figure 2.
Since Z(A, B) in Figure 2 changes the values of C.ctrlD(f), it should
be executed when CTR(f) = ? holds, because otherwise its execution may
violate P3'. Thus, Z(A, B) would have to be started no earlier than after(Xia)
and terminate by at(X2a). Unfortunately, Z(A, B) may take a significant
amount of tim?even though its component statements can be executed
in parallel, the time to execute Z(A, B) will include some communication
and synchronization overhead. This extra time could make satisfying P2
hard or impossible.
Property P3' is perhaps a bit too strong. In fact, all that is really required
is that a controller be able to communicate with the process that owns a
flight. For example, C.ctriD(f) could be the start of a path of controllers,
terminating with the current owner. The scheme where C.ctrlD(f) indicates
the current owner is equivalent to requiring that this path have a length of
?. But, longer paths are also acceptable.
Let C C' denote that C.ctriD(f) = C', and let C ?N??* C1 denote the
transitive closure of ?. Using this notation, P3' can be expressed as:
ftrue?
Z(A, B): cobegin
Zp:
Th:
IIc:c??A,By: ?true?
(C.p.ctrlD(f):= B? at C.p
?up(C.p) ? (C.p.ctrlD(f) = B)J
IIc:c??A,By: ?true)
(C.b.ctrlD(f):= B? at C.b
?up(C.b) ? (C.b.ctrlD(f) =
coend
?up(C.p) ? (C.p.ctrlD(f) =
A up(C.b) ? (C.b.ctrlD(f) =
therefore, according to the definitions of C.ctrlD(f)
?VC??A,B?:C?B)Y
Figure 2: Hand-off Protocol for Controllers other than A and B
P3': (?C: C? C) ? (?C: C? CA(VC': C'? C)).
We weaken P3' as follows:
P3": (?C: C			C) ? (?C: C			C A (VC': C' ??
P3" is left invariant by the protocol in Figure 1. P3" is also an invariant
of the protocol of Figure 2 provided B B v B A initially holds. From
post(Z(A1 B)) and post(X2a), we conclude that as long as the execution
of Z(A, B) completes before another hand-off starts, F3' will hold once
Z(A, B) and the protocol in Figure 1 have both terminated. Since P3' implies
B B v B A, the system is once again in a state from which a hand-off
can be performed. Hence, Z(A, B) can begin executing at any point during
the hand-off from A to B--Hbecause its precondition, B B V B A, holds
throughout the protocol of Figure 1. And, Z(A, B) must complete before a
subsequent hand-off has started.
4.2 Implementation using Messages
So far, the protocol we have derived consists of assignment statements to
various variables that reside on separate processes. The protocol consists
14
of the three processes1 as follows:
cobegin
X?b : (A.b.ctr(f):= JaThe) atA.b
Xla: (A.p.ctr(f):= JaThe) at A.p
X2b : (B.b.ctr(f):= true) at B.b
X2a: (B.p.ctr(f):= true) at B.p
Zp:
Zb:
coend
IIc:??A,By: (C.p.ctrlD(f):= B) at C.p
IIc:???,B): (C.b.ctrlD(f):= B) at C.b
An actual Implementation would require that each assignment state-
ment be executed by the processor whose variable is being set. Further-
more, the assignment statements of the first process must be sequenced.
This sequencing will be accomplished in our implementation by processor
B.p1 since this processor starts the protocoL If B.p crashes, then B.b will take
over the sequencing. Because all assignments are constants to variables,
when taking over, B.b can simply start at the beginning of the sequenc?it
not need to know how far B.p got before faffing.
B.b does need to know when B.p has finished executing the hand-off
protocol. Otherwise, a crash of B.p might cause B.b to re-execute the hand-
off from A to B after f has been later handed off to another controller,
in which case B.b would undo that later hand-off. Hence, B.b must be
notified of the completion of the hand-off before any subsequent hand-offs
are started. We represent the fact that a hand-off from A to B is in progress
with a variable B.b.xjr, whose value is initially I.
In order to continue the implementation using messages, some further
details of the AAS system services must be given.
Communication between resillent processes uses send and receive.
If some process sends a message m to a resilient process C, then m is
enqueued at C.p if C.p has not crashed and enqueued at C.b if C.p has
crashed. Furthermore, send does not return control until the message
has been enqueued at the remote process. The remote process may
crash after enqueueing m but before delivering m, in which case m is
lost.
o+ The primary of a resilient process communicates with its backup
using log. Like send, log does not return control until the message is
enqueued by the remote process. A log that is executed when there is
no backup (for example, when the backup has crashed or when log is
executed by the backup itself) does nothing and immediately returns
control.
o+ Until the primary of a resilient process crashes, the backup delivers
only messages sent by log.
o+ When primary C.p crashes, C.b takes over by first processing any
enqueued messages sent by C.p using log. It then executes the user-
defined recovery protocol. And, finally, it receives messages sent to
We also use a variable in each process to represent the value of variables
C.p.ctrlD(f) and C.b.ctrlD(f). A simple approach would be to introduce
C.p.cwner(f) and C.b.owner(f), such that:
C.p.ctrlD(f): C.p.owner(f)
C.b.ctrlD(f): C.b.owner(f).
Doing so, however, is inefficient (as well as difficult given the AAS com-
munication primitives). Consider X?b in the hand-off protocol. To im-
plement X?b, B.p would send a message to A.b instructing it to execute
A.b.owner(f):= B. Since Xlb must complete before Xla starts, B.p cannot
start Xla before A.b completes its assignment. The result is two end-to-end
message delays.
A more efficient hand-off protocol can be implemented using the follow-
ing definitions of C.p.ctrlD(f) and C.b.ctrlD(f). Let the predicate Ec(f, X)
mean that C.b has enqueued but not yet processed a log from C.p that re-
quests the execution of C.b.owner(f):= X, and let Vc(f) be the value of X
in the most recent such log message. Then, we define:
C.p.ctrlD(f): C.p.owner(f)
C.b.owner(f) if (VX: ?Ec(f,X))
C.b.ctrlD(f):			Vc(f)			if(?X: Ec(f,X))
B.p can cause the execution of Xlb followed by Xla simply by sending
a single message to A requesting execution of owner(f):= B. Upon delivery
of this message, A.p first executes a log so A.b learns of the message. Since
log does not return until EA(f, B) holds, post(Xlb) holds when log returns.
A.p can then establlsh post(X?a) by executing C.p.owner(f):= B.
16
cobegin
?pre(X1b)J
(log "xjr:= (LA)"? at B.p
(send "owner(f):= B" to A? at B.p
(wait for "ack" from A? at B.p
?post(X1a) A pre(X2b)J
(log "owner(f):= B") at B.p
fpost(X2b) A pre(X2a)1
(B.p.owner(f):= B? at B.p
fpost(X2a)Y
(VC : C ? ?A, B): send "owner(f):= B" to C? at B.p
(log "xfr:= I"? at B.p
IIc:c#B: (when deliver "owner(f):= X") at C.p
(log "owner(f):= X"? at C.p
= A) ? post(Xlb) A (C # A) ? post(Zb)1
(C.p.owner(f):= X? at C.p
= A) ? post(Xla)
A (C # A) ? (post(Zp) A post(Zb))Y
(send "ack" to X) at C.p
lic:
lic:
coend
(when deliver "x:= v" from C.p do C,b.x:= v') at C.b
(when C.p fails do
if xfr =
then start hand-off of f from X to C') at C?b
Figure 3: Complete Hand-off Protocol
17
The complete hand-off protocol is shown in Figure 3. The assertions in
the code refer to Figures 1 and 2.
Acknowledgements Scott Stoller provided helpful comments on a draft
of this paper. We also would like to thank Mary Jodrie and Alan Moshel
for bringing the problem to our attention and for helping us develop the
requfrements.
References
[1]
E Cristian, B. Dancey and J. Dehn. Fault-tolerance in the Advanced
Automation System. In Proceedings of 20th International Symposium on
Fault-Tolerant Computing, Newcastle Upon Tyne, UK, 26-28 June 1990),
pp. 6-17.
[2] C. A. R. Hoare. An Axiomatic Basis for Computer Programming. Com-
munications of the ACM 12(1O):57?58O (October 1969).
[3]
Richard D. Schlichting and Fred B. Schneider. Fall-Stop Processors:
An Approach to Designing Fault-Tolerant Computing Systems. ACM
Transactions on Computer Systems 3(1):222--H238 (August 1983).
[4] Fred B. Schneider. On Concurrent Programming. To appear.
18
