BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1367
ENTRY:: 1993-10-14
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: A Security Architecture for Fault-Tolerant Systems
AUTHOR:: Reiter, Michael K.
DATE:: July 1993
PAGES:: 118
COPYRIGHT:: Michael K. Reiter 1993 - All Rights Reserved
ABSTRACT::
While there is considerable experience with addressing the needs for security 
and fault-tolerance individually in distributed systems, much less is 
understood about how to simultaneously address these needs in a single, 
integrated solution. Indeed, the goals of security and availability have 
traditionally been viewed as being in conflict, because replicationg data and 
services for availability makes them inherently harder to protect.

This thesis presents the design and implementation of a security architecture 
for fault-tolerant systems, including a set of results that underpin this 
architecture. We first present a methodology for balancing the aforementioned 
tradeoff between security and availability in distributed services. Using our 
techniques, a service can be replicated so that it will remain available and 
correct despite the corruption of some servers and clients by a malicious 
intruder. These results include the identification and prevention of a new 
form of attack in which an intruder effects and exploits violations of 
causality in the sequence of requests processed by the service. 

Second, we bring this replication methodology and other novel techniques to 
bear on an issue for which the conflict between security and availability is 
particularly troublesome, namely cryptographic key distribution via trusted 
services. We present authentication and time services that can securely and 
fault-tolerantly support cryptographic key distribution in a wide range of 
settings.

Third, we present the design and implementation of our security architecture, 
which employs these services. The architecture supports process groups
--a common paradigm of fault-tolerant computing--as its primary security 
abstraction, and provides tools to construct applications that are resilient 
to benign failures and malicious attacks. We discuss the integration of this 
architecture in the Horus system and focus on techniques to make group 
communication secure and efficient.

In the final contributions of the thesis, we further explore the importance of 
detecting causal relationships for security. We present a framework for 
examining attacks onaattempts to detect causal relationships. We also present 
several algorithms to prevent these attacks in some situations.
END:: CORNELLCS//TR93-1367
BODY::
A Security Architecture for
Fault-Tolerant Systems
Michael K. fleiter
Ph.D Thesis
93-1367
July 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
A SECUffITY AR?CHITECTUIQE FOfQ
FAULT-TOLEkANT SYSTEMS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Michael K. Reiter
August 1993
Qc Michael K. Reiter 1993
ALL RIGHTS RESERVED
A SECURITY ARCHITECTURE FOR FAULT-TOLERANT SYSTEMS
Michael K. Reiter, Ph.D.
Cornell University 1993
While there is considerable experience with addressing the needs for security and
fault-tolerance individually in distributed systems, much less is understood about
how to simultaneously address these needs in a single, integrated solution. Indeed,
the goals of security and availability have traditionally been viewed as being in con-
flict, because replicating data and services for availability makes them inherently
harder to protect.
This thesis presents the design and implementation of a security architecture
for fault-tolerant systems, including a set of results that underpin this architecture.
We first present a methodology for balancing the aforementioned tradeoff between
security and availability in distributed services. Using our techniques, a service can
be replicated so that it will remain available and correct despite the corruption
of some servers and clients by a malicious intruder. These results include the
identification and prevention of a new form of attack in which an intruder effects
and exploits violations of causality in the sequence of requests processed by the
service.
Second, we bring this replication methodology and other novel techniques to
bear on an issue for which the conflict between security and availability is par-
ticularly troublesome, namely cryptographic key distribution via trusted services.
We present authentication and time services that can securely and fault-tolerantly
support cryptographic key distribution in a wide range of settings.
Third, we present the design and implementation of our security architecture,
which employs these services. The architecture supports processgroups a common
paradigm of fault-tolerant computing as its primary security abstraction, and
provides tools to construct applications that are resilient to benign failures and
malicious attacks. We discuss the integration of this architecture in the Horus
system and focus on techniques to make group communication secure and efficient.
In the final contributions of the thesis, we further explore the importance of
detecting causal relationships for security. We present a framework for examin-
ing attacks on attempts to detect causal relationships. We also present several
algorithms to prevent these attacks in some situations.
Biographical Sketch
Michael Reiter was born on February 9,1967, in Rochester, New York. At an early
age, his family moved to San Antonio, Texas, where he spent the majority of his
childhood. At age fifteen, he moved with his mother to Candler, North Carolina.
He later attended the University of North Carolina at Chapel Hill, where on May
14, 1989, he earned the degree of Bachelor of Science in Mathematical Sciences
with Highest Honors and Highest Distinction. He entered graduate school in the
Department of Computer Science at Cornell University in the fall of 1989. On July
28, 1990, he married Patricia L. Kelly. He earned his Master of Science degree in
computer science on August 26,1991.
iii
To TYicia
iv
Acknowledgements
More than ally other person, I am indebted to my wife, Patricia, for her infinite
patience, love, and encouragement. I have been much more of a student than a
husband during my time at Cornell. Yet, Tricia always understood, being content
just to make me laugh, or just to listen, whenever I needed it. I could not have
come this far without her.
Possibly the most regrettable part of leaving Cornell is ending my term as Ken
Birman's student. I have learned a tremendous amount from Ken, both technically
and professionally, and there is yet much more that I could learn from him. Ken
has gone far beyond the call of duty to foster my growth both as a person and
as a scientist. As an advisor, Ken is marked by his generosity, optimism, and
supportiveness. I am very appreciative of his friendship and guidance.
Two other people deserve special mention. Li Gong has acted like a second
advisor to me in many ways. Indeed, were it not for Graduate School regula-
tions forbidding it, Li would almost certainly have been a member of my thesis
committee. His expertise in the field of security was invaluable when I began my
research, and his continued support has been instrumental in the development of
my research in secure systems. In particular, the results of Chapter 5 of this thesis
were obtained with Li's input and guidance.
v
The second individual deserving a special mention is Robbert van Renesse.
Robbert's prowess in computer systems and his ever-present willingness to lend a
helping hand has made my time at Cornell much less frustrating than it could have
been. Working on the Horns project with Robbert has been an absolute pleasure
and a tremendous learning experience.
I extend my gratitude to my other committee members, Sam Toueg and Richard
Platek, for providing helpful input and advice when needed. I have especially taken
advantage of Sam when I could catch him and he has always helped with a smile.
In addition to those mentioned above, I have greatly benefited from interac-
tions with many other transient and more lasting members of the computer science
community at Cornell. Among others, these people include Carlos Almeida, Tom
Bressoud, Navin Budhiraja, Thshar Chandra, Tim Clark, Robert Cooper, Brad
Glade, Juris Hartmanis, Neel Jain, Dexter Kozen, Cliff Krumvieda, Dalia Malki,
Keith Marzullo, Gene Ressler, Aleta Ricciardi, Andre Schiper, Fred Schneider,
Patrick Stephenson, Uwe Wilhelm, and Raphael Yahalom. Carlos, Navin, Thshar,
Dalia, Patrick and Uwe deserve special praise for enduring me as an officemate.
I am also grateful to many non-Cornellians who have provided helpful informa-
tion during my thesis work, specifically Matt Blaze, Yair Frankel, Jack Lacy, and
Andrew Odlyzko.
I was fortunate to be supported by a National Science Foundation Graduate
Fellowship during my first three years at Cornell, and by a scholarship from GTE
during my last year. I also benefited from funding provided to the Isis and Horns
projects, in the form of DARPA/NASA subcontract NAG-2-593 administered by
the NASA Ames Research Center, DARPA/ONR grant N00014-92-J-1866, and
vi
grants from GTE, IBM, and Siemens, Inc. Of course, the standard disclaimers
apply: any opinions, conclusions or recommendations expressed in this thesis are
entirely my own and do not necessarily reflect the views or decisions of any of the
organizations mentioned in this paragraph.
I thank my family and my wife's family for their unswerving support through
graduate school. I especially thank my mother, whose love and caring I could
always feel. Finally and most importantly, I thank God, for granting me the
abilities and guidance to accomplish what I have.
vii
viii
Table of Contents
1 Introduction
2
3
How to securely replicate services
2.1			Introduction
2.2			State machine replication
2.3			The system model
2.4			Preserving integrity and			availability
2.4.1			Threshold signature schemes
2.4.2			The protocol
2.4.3			Discussion .			. . .
2.4.4			Performance: an example
2.5 Preserving input causality .
2.5.1			Threshold cryptosystems
2.5.2			The protocol . .
2.5.3			Discussion
2.5.4			Alternative approaches
2.6			Implementation issues
2.6.1			Atomic broadcast . . . .
2.6.2			Public encryption keys
2.6.3			Detection of corrupt processes
2.6.4			Summary . . .
2.7			R?elated work
2.8			Summary .
Fault-tolerant key distribution
3.1			Introduction
3.2			The time service			. . .
3.2.1			The algorithm . . . .
3.2.2 Comparison to alternative designs
3.3			The authentication service .			. .
3.3.1 The algorithm .
3.3.2 Comparison to alternative designs .
ix
6
6
9
10
14
15
17
19
21
25
28
29
35
36
38
39
40
41
42
42
43
44
44
46
46
51
52
53
56
3.4 Summary and future work
Secure process groups
4.1			Introduction
4.2			Programming model
4.3			Implementation .			.
4.3.1			Group keys			. .
4.3.2 The group join protocol
4.3.3 Secure group communication
4.4 Summary and discussion
4.4.1 Status of the system
4.4.2 Related work
4.4.3 Future work
59
4
5
Preventing denial and forgery of causal relationships
5.1			Introduction .			. . .			.			.			.			. .
5.2			The system model. .			.
5.3			Definition of causality			. .			.			.			. .
5.4			Causal security goals			. . .			.			.			.
5.5			Preventing denial . .			.
5.5.1			The causality			server .			. .			.			.			.
5.5.2			The conservative			approach
5.6			Preventing forgery. .			.
5.6.1			Signed vector			timestamps			.			. .			.
5.6.2			The piggybacking			algorithm
5.7			Summary and			discussion .			. .
5.7.1			Related work			. . .
5.7.2			Future work			.			.			.			. .
6 Conclusion
Bibliography
x
60
60
61
63
64
67
72
79
79
80
80
82
82
85
86
87
89
90
92
94
95
99
104
105
106
108
111
List of Tables
xi
2.1 Summary of notation
12
List of Figures
2.1
2.2
2.3
2.4
3.1
3.2
3.3
4.1
4.2
4.3
4.4
4.5
4.6
4.7
5.1
5.2
5.3
Structure of processes
New structure of processes
Response times of an example service
New structure of processes. . .			.
Protocol by which client C interacts with time service T
Protocol by which client C obtains a certificate for principal P
Design options for a fault-tolerant public key authentication service
Secure process groups
The llorus security architecture. .
VSYNC protocol by which site A joins group G, containing site B
Overview of the VSYNC group join protocol
Cryptographic operations			. 			. .
Member-to-group RPC latencies (ms)			.
Member-to-group bandwidth (kb/s)			.			.
Causality detection
The causality server CS
The piggybacking algorithm
xii
11
15
22
27
47
55
57
63
64
68
70
74
77
78
88
90
100
Chapter 1
Introduction
This thesis is a compendium of results in the area of security in distributed sys-
tems. The primary motivation behind this work was an effort that began in 1991
to construct the liorus system. Horus would be a new and improved version of
Isis, a system built at Cornell for exploiting distribution to build fault-tolerant
applications [BJ87b,BSS91]; indeed, the name Horus was taken from the son of
Isis in Egyptian mythology. One goal of Horus was to address security concerns
without compromising the fault-tolerance provided by Isis, and the results of this
thesis are the product of that effort. The thesis culminates with a description of
the design and implementation of the security architecture for Horus at which we
arrived, hence the title A Securnty Archztecture for Fault- Tolerant Systems.
There are many issues that must be addressed in the construction of a "secure
system," and we have made no attempt to address them all in the construction of
our security architecture. The focus of this work has been on security problems
unique to, or exacerbated by, distribution. Moreover, we have concentrated pri-
marily on systems issues, to the exclusion of formalisms for specifying or reasoning
about security policies or the correctness of implementations. As a result, in our
prototype implementation of the architecture, addressing intra-node security issues
and verifying the system rigorously, both of which are essential for the system to
2
be used in critical applications, have been only a secondary concern. Of primary
concern have been questions of how security can be provided in distributed systems
without degrading performance and fault-tolerance.
To avoid confusion, a note on our use of the terms "security" and "fault-
tolerance" is in order. There have been several attempts in the scientific literature
to unify these historically separate notions. For instance, it has been suggested
that security can be viewed as a subset of fault-tolerance, where the faults being
tolerated are deliberate and tend to be very complex [TH86]. Laprie similarly ar-
gues that security should be viewed as the avoidance of and tolerance to deliberate
faults, but also goes further to describe security and reliability as different aspects
of a common problem [Lap89] (see also [RD86,DR86]).
Nevertheless, in this thesis our terminology adheres to the more traditional
viewpoint that security and fault-tolerance address different concerns. The most
common form of failure in most systems is where some component has, or appears
to have, simply stopped execution (i.e., crashed); it is these "benign" failures for
which Horns facilitates tolerance, While more malignant failures are possible, they
are extremely rare in real systems and would normally be readily detectable. How-
ever, in systems where deliberate hostile attack is sufficiently likely, the malicious
corruption of some system components becomes more of a concern. It is in this
arena where security techniques are needed. Therefore, here we generally use the
term fault-tolerance to denote the ability of an application or service to remain
correct and available despite the benign fazlure (versus the malicious corruption)
of some of its components. More precise definitions will be given where needed.
Prior to this work, there was very little success in addressing the needs for secu-
rity and fault-tolerance simultaneously in distributed systems. Indeed, the goals of
security and availability in distributed systems have traditionally been viewed as
being in conflict. The reason for this is that the only generally feasible technique
for making data or services highly available is to replicate them. But, replicating
3
them also makes them harder to physically protect [T1186,HT88,LABW92]. This
conflict is particularly pertinent to services that are involved in enforcing security
policy, i.e., that are part of the trusted computing base (TCB) [DoD85] of a system.
Prudence dictates that the TCB should be kept as small and localized as possible
in order to facilitate its protection. Distribution of services in the TCB makes its
protection more difficult.
This thesis begins in Chapter 2 with a methodology for balancing this tradeoff
between security and availability in distributed services. More precisely, in Chap-
ter 2 we present a method for constructing replicated services that retain their
availability and integrity despite several servers and clients being corrupted by an
intruder, in addition to others failing benignly. We also address the issue of main-
taining a causal [Lam78b] order among client requests. We illustrate a security
breach resulting from an intruder's ability to effect a violation of causality in the
sequence of requests processed by the service, and propose an approach to counter
this attack. An important and novel feature of our schemes is that the client need
not be able to identify or authenticate even a single server. Instead, the client is
required only to possess at most two public keys for the service. We discuss the
performance of a service implemented with one of our protocols, and we argue for
the practicality of our schemes through a discussion of several issues pertinent to
their implementation and use.
The techniques presented in Chapter 2 have been of great use in our security
architecture. Like many security architectures, ours employs cryptography to pro-
tect communication from corruption or unauthorized disclosure. This, in turn,
requires that some form of secure key distribution exist. Many authentication or
key distribution protocols have been proposed to distribute cryptographic keys for
secure communication in open networks (e.g., [NS78,D881,OR87,SN888]). How-
ever, these protocols often employ trusted authentication and time services whose
corruption or failure could result in security breaches or prevent correct principals
4
from establishing secure communication. In Chapter 3, we describe how we have
employed the techniques of Chapter 2 to construct a highly available and secure
authentication service. We also describe the design and implementation of a secure
time service to support fault-tolerant key distribution. Together, these services can
securely and reliably support key distribution in a wide range of systems.
A discussion of how these services are used in our architecture is deferred until
Chapter 4, where we finally detail the security architecture that we have developed
and implemented in Horus. The security architecture supports process groups a
common paradigm of fault-tolerant computing [CZ85,BJ87b,KT91,ADKM92]--Has
its primary security abstraction, and provides tools for the construction of appli-
cations that can tolerate both benign component failures and advanced malicious
attacks. By integrating this architecture into Horus, we have secured Horus' v?rtu-
ally synchronous process group abstraction and the programming tools it supports.
Moreover, the resulting system gives us the opportunity to compare the perfor-
mance of insecure and secure operations in Horus and to illustrate how techniques
such as caching and moving costly operations off critical paths can be used to make
secure operations efficient.
The final contributions of this thesis appear in Chapter 5, where we further
explore the issue of detecting causal relationships in hostile environments. As
described above, this issue is initially raised in Chapter 2 in our discussion of
maintaining a causal order among requests to a replicated service. However, as we
will show in Chapter 5, the attack discussed in Chapter 2 is only an instance of one
type of attack on attempts to detect causal relationships. In Chapter 5, we motivate
the need to accurately detect causal relationships in a wide range of settings despite
malicious behavior, and we present a framework within which attacks on causality
detection can be examined. Then, we present several algorithms for defending
against attacks on causality detection in some situations.
Much of each chapter of this thesis is taken verbatim from previously written,
5
independent papers (specifically, [R1392,RG93,RBvR93J and revisions thereof). As
a result, each chapter is largely self-contained, and a reader need not have read
previous chapters to understand a given chapter.
Chapter 2
How to securely replicate
services
2.1 Introduction
Distributed systems are often structured in terms of clients and services. A service
exports a set of commands, which clients invoke by issuing requests to the service.
After executing a command, the service may return an appropriate response to the
client that invoked the command. In the simplest case, the service is implemented
by only one server. If this server is not sufficiently immune to failure, however,
then the service must be replicated.
As mentioned in Chapter 1, replication can introduce security risks. In partic-
ular, it is often more difficult, or at least requires more resources, to protect many
distributed servers from corruption by an intruder than it is to protect only a single
server [TH86,HT88,LABW92]. To compensate for this, a replicated service could
be designed to remain available and correct despite several servers being corrupted
by an intruder (in addition to others failing benignly). One way to do this employs
the state machine approach [Sch90] to replicating the service, in which each server
individually computes each response and sends it to the client. If the client authen-
6
7
ticates the response from each server and accepts the response sent by a majority
of servers, then it obtains the correct response if a majority of servers are correct.
This approach, however, requires more from clients in ability, computation,
and storage than not replicating the service. First, the client must always know
how many servers comprise the service and must be able to authenticate each of the
servers individually. This may be difficult if the set of servers can change over time
or if there is no trustworthy source from which the client can obtain the servers'
identities or authentication information. Second, if there are n servers, then the
client must perform 0(n) additional cryptographic computations to authenticate
replies than if the service were not replicated. Third, the client must possess a
public key for each server or a secure channel to each server, with an n-fold cost in
storage. Finally, in cases in which a client must forward the (digitally signed) replies
of the servers to other parties, as is the case, e.g., in many cryptographic protocols
(see the "push" technique of [LABw92]), the client must store and forward 0(n)
replies, instead of only one as if the service were not replicated.
In this chapter we propose a solution to these problems using a modification of
state machine replication. In its full generality, our approach can be used to imple-
ment a service with n servers so that for each request, a client accepts a response
computed by a correct server, and no other responses, provided that n > t + b
where b is the maximum number of corrupt servers and t is the maximum number
faulty and corrupt servers combined. A novel feature of our approach is that unlike
"normal" state machine replication described above, each client possesses exactly
one public key for the service (as opposed to one for each server) and can treat
the service as a single object for the purposes of authentication. This enhances
application modularity and significantly simplifies the service interface for clients,
because each client is insulated from internal security policies of the service and
details of what or how many servers comprise the service. We emphasize that the
client need not be able to identify or authenticate a single server to authenticate
8
the response of the service. Moreover, the client incurs no additional cryptographic
computation or storage costs than if the service were not replicated.
Even in a system with the above guarantees, correct clients may accept im-
proper responses from the service if an intruder has caused the correct servers to
process improper requests or to process requests in an incorrect order. In this
chapter we also discuss this issue. We focus on an attack in which an intruder
effects and exploits a violation of causality in the sequence of requests processed
by the service. We describe a way to avoid this attack that requires that each
client possess only one additional public key for the service. And, the service is
guaranteed to make progress if n > t + 2b. llowever, this approach has shortcom-
ings: e.g., while each client is required to possess only one additional public key
for the service, in many cases this additional key must be different for each client.
We discuss alternatives to our approach that eliminate this need (but introduce
others), and cryptographic advances that could improve our approach.
As will be discussed in Chapter 3, we have used our techniques to implement
an authentication service in our security architecture for fault-tolerant systems. In
our security architecture, this authentication service securely and fault-tolerantly
supports the distribution of cryptographic keys for secure communication in open
networks. In this chapter we use this service to illustrate how one of our protocols
can perform in practice. In addition, we devote one section to a discussion of
several practical issues that arise in the implementation and use of our protocols,
as well as ways to optimize our methods in practice.
The remainder of this chapter is structured as follows. In Section 2.2 we give a
brief overview of state machine replication; for more detail, the reader should see
[Sch9O]. In Section 2.3 we enumerate our assumptions about the system. In Section
2.4 we present a method of implementing services that provides the availability and
integrity guarantees outlined above. In Section 2.5, we discuss the importance of
maintaining causality among client requests and a method to counter an intruder's
9
attempts to exploit violations of causality. Section 2.6 discusses several issues
pertinent to the implementation and performance of our protocols. We outline
related work in Section 2.7 and conclude in Section 2.8.
2.2 State machine replication
A state machine consists of a set of state variables and exports a set of (possi-
bly parameterized) commands. The state variables encode the state of the state
machine, and the commands transform that state. A client of the state machine
invokes a command by issuing a request to the state machine. Each command is
implemented by a deterministic program and is executed atomically (i.e., indivis-
ibly) with respect to other commands. Commands should be executed by a state
machine in an order that is consistent with Lamport's causality relation [Lam78b].
That is, requests from the same client should be processed in the order they were
issued, and if one request could have caused another from a different client, then
a state machine should process the former before the latter. Execution of each
request results in some response (i.e., output), which we assume is returned to
the client that issued the request. Responses of a state machine are completely
determined by its initial state and the sequence of requests it processes.
State machine replication is a general method of implementing a replicated
service by simultaneously employing many state machine servers and coordinating
client interactions with them.1 If all servers are initialized to the same state, and if
all correct servers process the same sequence of requests, then all correct servers will
give the same response to any given request. By properly combining the responses
of the servers to form the response of the service, where "properly" depends on
the number of servers that may be faulty or corrupt, the service can retain its
?Whlle the state machine servers must satisfy the same specification, they need not be identical
replicas of one another, and it may be desired that they not be identical, to avoid a (possibly
deliberate) design flaw affecting all of them. Employing non-identical replicas is similar to the
use of fl-version programmin9 as applied in [Jos87,JA88].
10
availability and integrity despite some server corruptions and failures.
2.3 The system model
Our system consists of a set of processes, n of which are servers and the remainder of
which are clients. All processes communicate exclusively by passing messages over
a network. We assume nothing about this network other than what is required
to implement the communication protocols on which our schemes rely; we give
specifications of these protocols later in this section.
A process is correct in a run of the system if it always satisfies its specification.
A process may fail by prematurely halting execution (i.e., by crashing)2; a process
that fails in a system run is said to be faulty in that run. In addition, a process
may be corrupted by an intruder, in which case it can behave in an arbitrarily
malicious manner. However, corrupt processes do not have access to the internal
states of correct or faulty processes and are limited by the (conjectured) properties
of the cryptosystems and signature schemes we employ. We assume that there exist
known constants t and b b ? t, such that in any system run, there are at most
b corrupt servers and at most t corrupt and faulty servers combined. We assume
that n > t + b.
While processes fail or are corrupted as a single unit, it is convenient to view
each process as consisting of logically separate modules (see Figure 2.1). More pre-
cisely, each server consists of a communication module, a coordination module, and
an application module. The communication module is closest to the network, the
application module is furthest, and the coordination module lies in between. The
application module of a server is a state machine as described in Section 2.2. The
coordination module delivers a request (c, m), with contents m and (apparently)
from client c, to that state machine by executing deiiver((c, m?); this places (c, m?
2The results of this chapter also hold if the faulty processes include those that fail by omission
or timing failures [CASD85J.
11
at the end of a list of requests to be processed by the state machine. Executions of
deliver are made strictly sequentially; i.e., deliver is not executed until all previous
executions of deliver have completed. The coordination module also implements a
primitive respond(e, m) by which the application module can send a response m
to a client c. The response m produced by the application module is of the form
(c, m', m11?, where this response resulted from processing the request (c, m'? and
m1, is the "contents" of the response. (Alternatively, the response could include
only an identifier for the request, instead of the entire request.)
(a) Server
Application Module
- - respond(c, m)
deli\re((c,m??
Coordination Module
rec?eivt((s,m?) abcast(m) send(Q,m%
Communication Module
Network
(b) Client
Application Module
request(m)
accept(m)
Communication Module
Figure 2.1: Structure of processes
Network
The communication module implements two primitives for use by the coordina-
tion module. The coordination module can execute send(c, m) to send a message
m to a client c, which we assume eventually arrives at c if the server that executes
send(c, m) is correct. So, respond(c, m) can be implemented by a single send(c, m).
In addition, there is an atom?c broadcast primitive abcast(m) by which the coor-
12
$
Table 2.1: Summary of notation
Definition
Notation
systemparameters
n total number of servers
b maximum number of corrupt servers
t maximum number of faulty and corrupt servers combined
serverroutines
deliver((c, m?)			delivers client request (c, m? to application module
respond(c, m)			application module responds to c with m
receive((s, m?)			receives message (s, m? for coordination module
abcast(m)			if executed at s, atomically broadcasts (s, m? to servers
send(c, m)			coordination module sends m to client c
clientroutines
accept(m)			accepts response m for the application module
request(m)			if executed at c, issues request (c, m? to service
dination module of a server s can broadcast a message (s, m? to all servers. A
server's communication module receives a message (s, m?, with contents m and
(apparently) from server 8, by executing receive((s, m?); this places (s, rn') at the
end of a list of received messages that is read by the coordination module. We
assume that executions of receive are performed strictly sequentially. Messages are
received only from servers, according to an atomic broadcast protocol that satisfies
the following specification.
Receipt Validity:
1. If server 8 is correct or faulty, then a correct or faulty server executes
receive((s, m?) only if s previously executed abcast(m).
13
2. If a correct server 8 executes abcast(m), then receive((s, m?) is executed
at all correct servers.
Receipt Order: If receive((s, m?) is the i-th execution of receive at a correct or
faulty server, then receive((s, m?) is the i-th execution of receive at all correct
servers. That is, all correct servers receive the same sequence of messages,
and the sequence of messages received by a faulty server is a prefix of the
sequence of messages received by a correct server.
Options for implementing atomic broadcast are discussed in Section 2.6.1. Here
we simply note that there already exist protocols in the literature that satisfy this
specification in various models. Our protocols do not rely upon any bounds on
message transmission times or the execution speeds of processes, and so the only
such bounds required for our results, if any, are those required by the particular
atomic broadcast protocol used.
A client consists of only two modules, an application module and a communz-
cation module. The application module of a client c is some client program that
can issue a request (c, m? to the service by executing request(m). The commu-
nication module of the client implements this primitive, e.g., by digitally signing
and broadcasting (c, rn? to the entire network. Assuming that each request from a
correct client eventually reaches some correct server, servers can easily implement
an atomic broadcast protocol for client requests by, e.g., forwarding requests to
the other servers via abcast and determining the delivery order of requests by their
order of receipt at correct servers.3 That is, we assume that servers deliver only
client requests, according to a protocol that satisfies the following properties.
Delivery Validity:
1. If client c is correct or faulty, then a correct or faulty server executes
deliver((c, rn')) only if c previously executed request(m).
3This protocol can be optimized so that a client request usually results in only one server
abcast, rather than n.
14
2. If a correct client c executes request(m), then deliver((c, m?) is executed
at all correct servers.
Delivery Order: If deliver((c, m?) is the ?-th execution of deliver at a correct or
faulty server, then deliver((c, m)) is the i-th execution of deliver at all correct
servers. That is, all correct servers deliver the same sequence of requests, and
the sequence of requests delivered by a faulty server is a prefix of the sequence
of requests delivered by a correct server.
Assuming that each server is initialized to the same state, Delivery Order im-
plies that all correct and faulty servers will produce the same response (or no
response) to a given request. The communication module of a client accepts a
response m for its application module by executing accept(m). Responses are ac-
cepted strictly sequentially; i.e., accept is not executed until all previous executions
of accept have completed.
2.4 Preserving integrity and availability
Recall from Section 2.1 that our first goal is to ensure that for each request, a
client accepts a response computed by a correct server, and no other responses.
We satisfy these requirements by replacing the respond(c, m) and accept(m) rou-
tines of servers and clients, respectively, with two new routines, respond'(c, m) and
accept'(m), that will ensure this. Thus, the new structures of processes will be as
pictured in Figure 2.2. While we have replaced the respond routine with respond'
at the interface provided to the application module of each server, respond will
be used in the implementation of respond'. Similarly, accept will be used in the
implementation of accept' to accept a message at a client. As desired, the respond'
and accept' routines will ensure that the following hold.
Integrity: A correct or faulty client c executes accept(m) only if a correct server
executes respond'(c, m).
15
Availability: If a correct client c executes request(m), then c executes
accept((c, m, m?N)) for some
(a) Server
Application Module
respond'(c,m)
Coordination Module
ireceivtc?s,?m? abcast(m) send(?,m)
Communication Module
Network
(b) Client
Application Module
request(m?) -
accep?(m)
Communication Module
Network
Figure 2.2: New structure of processes
2.4.1 Threshold signature schemes
The respond' routines at the different servers will employ a (k, n) -threshold stg-
nature scheme. A (k, n)-threshold signature scheme is, informally, a method of
generating a public key and n shares of the corresponding private key in such a
way that for any message m, each share can be used to produce a partial result
from m, where any k of these partial results can be combined into a signature for m
that can be verified with the public key. Moreover, knowledge of k shares should
be necessary to sign m, in the sense that without the private key it should be
computationally infeasible to (i) create a signature for m without k partial results
16
for m, (ii) compute a partial result for m without the corresponding share, or (iii)
compute a share or the private key without k other shares.
Cryptanalytic attacks against threshold signature schemes differ from those
against conventional signature schemes in that the cryptanalyst may possess some
number of shares and be able to acquire partial results, in addition to
message/signature pairs. For our purposes, we will say, informally, that a (k, n)-
threshold signature scheme is secure if it satisfies the following properties:
1. Possession of only k --H 1 or fewer shares and of partial results for various
messages does not facilitate signing new messages. That is, if a possessor of
such information can sign a new message, then it could also sign that message
without knowledge of any shares or partial results for any messages. This
property, which is formalized in [FD92], says that the threshold signature
scheme is as secure as the conventional signature scheme on which it is built.
2. The conventional signature scheme on which the threshold signature scheme
is built prevents selective forgery under known signature attacks. That is, a
cryptanalyst cannot manage to sign messages of its choice even though it can
see signatures for various other messages (not of its choice).
Our respond' routine is not dependent upon any particular implementation of a
(k, n)-threshold signature scheme. For concreteness, however, we describe respond'
in terms of an implementation proposed in [FD92] (which is a slight variation of one
proposed in [DF92]). That implementation begins with an RSA [RSA78] public
key (e, N) and private key d, where N is the product of two safe primes and the
Carmichael function A is used in place of Euler's totient function ? to create e
from d. That is, ed 1 mod A(N), where A(N) is the smallest positive integer
such that xA(N) 1 mod N for all x E ZN*. The n shares fKiJi<i<n are generated
from d in such a way that for any set T c f1,..., n? of size k, ZiET(Ki Pi,T)
d --H 1 mod A(N), where the integers tPi,T?iET are fixed a priori and public. (Each
17
of the integers Pi,T for all i and T can be computed from a fixed set of n integers,
each of binary length O(log n), with 0(n) integer multiplications and additions.)
By defining the i-th partial result for a message m to be am,i = mKi mod N, it
follows that for any T C ?1,... , ?1 of size k, Am,T = m Hi?T(am,i)Pi?T mod N is
the proper RSA signature for m, i.e., md mod N. Variations of this scheme have
been proved to be as secure as RSA, in the sense of property 1 above [FD92].
For reasons of security and efficiency, it is often preferable to sign a message
digest of a message, as opposed to the message itself [Den84]. A message digest
function f has the properties that the message digest f(m) for any input m can be
computed efficiently, but it is computationally infeasible to produce two inputs m
and m' such that f(m) = f(m') or to produce any input m such that f(m) = D for
some prespecified message digest D. Several efficient implementations of message
digest functions have been proposed (e.g., [Riv9l]). We henceforth use f to denote
such a function.
2.4.2 The protocol
Suppose that a public key (e, N) and corresponding shares fKi?i<i<n are created
as above, with the threshold parameter k = b t 1, and distributed so that for all
i 1 ? i ? n the i-th server 5j is the sole possessor of Kj and any process can
reliably obtain (e, N), the public key of the service. We do not discuss methods for
distributing this information, although we note that all public key systems require
similar steps. The (information for computing the) integers Pi,T for all i and T can
be "hardwired" into the implementation of the servers. Then, the respond'(c, m)
and accept'(m) routines are implemented as follows.
Routine respond'(c, m) at server s?:
1. Execute abcast((f(m), a1(m),i?), where ?f(m),i = (f(m))Ki mod N is
5j'5 partial result for f(m).
18
2. Waft until a set of messages 1(sj, (f(m), ?j??ljET? IT = k, has been
received such that Af(m),T = f(m) fIjET(Oj)PiT mod N is a valid
signature for f(m) (i.e., such that (Af(m),T)? = f(m) mod N).4
3. Execute respond(c, (m, Af(m),T?).
Routine accept'(m) at client c:
1. If m is not of the form ((c, m', m11?, 5'), then return to the calling routine.
2. If accept((c, m', mm)) for some mm was previously executed at c or if
request(m') was not executed at c, then return to the calling routine.
3. If 5' is a valid signature for f((c,m',m11)) (i.e. if ?e = f((c,m',m11))
mod N), then execute accept((c, m', m11)) and return to the calling rou-
tine only after it has completed.
Theorem 2.1 If the threshold signature scheme is secure, then this protocol sat-
isfies Integrn'tu.
Proof. Because at most b ? k servers are corrupt and because each correct or faulty
server produces partial results only for responses that it computes, the assumption
that the threshold signature scheme is secure implies that the corrupt servers can
generate signatures only for (the message digests of) responses computed by a
correct server (and possibly for other, presumably useless, message digest values).
That is, the corrupt servers cannot generate an improper response that a client
will accept, and so any response that is accepted by a correct or faulty client was
computed by a correct server. ?
Theorem 2.2 This protocol satisfies Availabilitg.
4Here we assume that messages can be received during the execution of respond'; e.g., locks
required to receive messages must not be kept by threads waiting in step 2 of respond'. More
generally, we assume that the replacement of respond and accept with respond' and accept' does
not result in the violation of Receipt Validity or Order, or of Delivery Validity or Order.
19
Proof. Because n > t + b, at least b + 1 = k correct servers broadcast their partial
results for each response. Thus, by Receipt Validity, each correct server eventually
receives k correct partial results for each of its responses, and so each correct server
can correctly sign each of its responses. Since each request issued by a correct client
is delivered at all correct servers (by Delivery Validity), it follows that each correct
client will accept a response for each of its requests, assuming that a response from
a correct server eventually reaches the client. Li
2.4.3 Discussion
In theory, the most computationally expensive part of this protocol is step 2 of the
respond' routine, in which the server sorts through the partial results it receives
until it finds a T of size k such that Af(m),T is a valid signature. The server must
examine at most the first k + b = 2b + 1 partial results received (from k + b unique
servers), and at most (k+kb) subsets of partial results, because in k + b partial
results are at least k correct ones. While this search is exponential in b in the
worst case and could be costly if b is large, in most systems n (and thus b) and
the actual number of corrupt servers will typically be small. (For example, our
experience with the Isis system [BSS91J suggests that fault?tolerant services are
sometimes implemented by three to five servers, but rarely more.) Thus, while we
are pursuing optimizations to make this search less costly as a function of b (see
Section 2.5.4), we view them to be primarily of theoretical interest.
In terms of communication complexity, in a run with no failures or corruptions
the replacement of respond with respond' results in an additional n executions of
abcast, which can be executed concurrently. Therefore, the entire protocol that
begins when a client issues a request and ends when it accepts a response consists
of three communication "phases" that must be executed roughly sequentially: the
request by the client, the dissemination of partial results (n concurrent executions
of abcast), and the sending of the responses (n concurrent executions of respond).
20
This communication can be optimized in at least two ways. First, because
a client needs to receive only one correctly signed response for each request for
Availability to be satisfied, only t+1 servers need to be designated to respond to any
given request. In addition, the partial results for the response need to be broadcast
only to that set of servers. A second optimization is for servers to communicate
partial results by a weaker broadcast protocol that satisfies only Receipt Validity,
because Receipt Order is not required for the dissemination of partial results in
this protocol. Since such a broadcast is weaker than atomic broadcast, it can be
implemented at least as efficiently. The protocol for disseminating partial results
could be further weakened to satisfy only part 2 of Receipt Validity. However, doing
so would allow corrupt processes to disseminate incorrect partial results on behalf
of correct and faulty servers, thereby further complicating each server's search for
partial results that form a correct signature.
In this protocol, each server must know O(m log n) bits of public information
(in order to compute the Pi,T'5), in addition to the public key for the service and
its share of the private key. Each client needs to know only the public key of the
service and must verify only one (correctly signed) response per request. While
additional correctly signed responses may arrive at the client,they are discarded
in step 2 of accept' without performing any cryptographic computation on them.
Versus "normal" state machine replication where clients authenticate each server
individually, the primary practical weakness of our approach is that by trading
client burden for server burden, we limit the ability of our services to scale to
very large systems with many clients. Our approach was motivated primarily by
the need for a highly secure and available service--Hthe authentication service of
Chapter 3 where alternatives for dealing with issues of scale previously existed
and would often be necessary anyway.5 In this case, our approach has yielded sub-
5Having one authentication service for a very large system can be administratively impractical,
and there may not be a single authority trusted by everyone to protect it (LABW92]. Approaches
using multiple authentication services (e.g., one per administrative domain) have been employed
in systems to remedy these problems (e.g., [SNS88,LABw92]).
21
stantial benefits by insulating clients from the implementation details and internal
security policies of the service (e.g., the value of b), and by minimizing overhead on
client processors and the latencies of client protocols that use the responses of the
service. Our approach should be an attractive alternative for systems with similar
requirements.
2.4.4 Performance: an example
As an example of how the protocol of Section 2.4.2 might perform in practice, in
this section we discuss the performance of the public key authentication service
that will be described in Chapter 3. This service was a primary motivation for our
work and has been implemented using the protocol of Section 2.4.2, with some of
the optimizations discussed in Section 2.4.3. As will be discussed in Chapters 3
and 4, this service is an integral part of our larger security architecture.
Briefly, in our security architecture the authentication service is the trusted au-
thority as to which public key belongs to which principal (e.g., computer, user). To
exercise this authority, our authentication service produces public key certificates
for principals, each of which contains a principal identifier and the public key for
that principal, bound together with the private key signature of the authentication
service. Because requests for certificates commute and need not be authenticated
(any client can obtain a certificate for any other), clients issue requests to the
service with an unordered, unprotected multicast.
The performance of our service is illustrated in part (a) of Figure 2.3. The
line indicates the mean response time in seconds as a function of n, the number of
servers. For each n, we assumed that t = b = [(n --H 1)/2j and 50k = [(n t 1)121.
We have illustrated the curve for up to nine servers, although we expect that few
services will be implemented by more than five, as mentioned in Section 2.4.3. In
these tests the RSA modulus N was 512 bits. These tests were executed between
user processes over SunOS 4.1.1 on moderately?loaded 33MHz Sun 4c/25 Sparc
22
ELC workstations. The workstations spanned two 10 Mbit/s ethernets connected
by a gateway. Each data point is the mean of 40 consecutive trials.
3.8			90
3.7			80
3.6			70
3.5			60
5			ms
3.4			50
3.3			40
3.2			30
3.1			20
3			5			7			9
n
(a) The solid line shows response times of
a service implemented with the protocol of
Section 2.4.2. The poor response time and
the rapid increase in response time as a
function of n are a direct result of the cost
of the (software) modular exponentiation
routines used.
- - - state machine
replication
a
3			T
7
n
9
(b) Here the modular exponentiation rou-
tines have been removed from the servers,
to simulate performance if the servers
were equipped with hardware for this.
The dashed line shows the service imple-
mented with normal state machine repli-
cation, again assuming optimal modular
exponentiation.
Figure 2.3: Response times of an example service
The cost of performing exponentiation modulo N is the direct cause of the poor
response times in part (a) of Figure 2.3 and, in general, is the limiting factor in the
performance of our authentication service. Exponentiation modulo N is required
to construct a partial result from a share and a message, and k exponentiations
modulo N (but with exponents much smaller than N) are used to construct a
signature from k partial results. Moreover, none of these modular exponentiations
lend themselves to well-known optimizations using the Chinese Remainder Theo-
rem (see, e.g., [SV93J), as aserver cannot be allowed to know the factorization of N.
23
Part (a) of Figure 2.3 reflects the cost of performing these modular exponentiations
in software,6 and we expect that hardware support for modular exponentiation will
generally be required at the servers to achieve satisfactory performance with the
protocol of Section 2.4.2. Even with most presently available hardware [Bri9O],the
cost of modular exponentiation will continue to be the primary limiting factor in
our protocol. However, advances that promise, e.g., modular exponentiation with
a 512-bit exponent and modulus in a few milliseconds or less (e.g., [8ed88,OSA9l,
SV93]), will make this less and less the case.
In our prototype implementation of the authentication service, we have not
invested in hardware to support modular exponentiation at the servers. To com-
pensate for its poor response time, we have designed our security architecture so
that interactions with the authentication service always occur in the background,
off the critical path of any other protocol or computation (see Chapter 4). Nev-
ertheless, to gain insight into the potential performance of our service with such
hardware, we have tested the performance of our prototype with the (software)
modular exponentiation routines removed from the servers. This is obviously an
optimistic simulation of hardware performance. However, in light of the recent
advances in modular exponentiation hardware just mentioned, these tests could be
indicative of the potential performance of our service.
The results of these tests are shown in part (b) of Figure 2.3. The solid line
indicates the mean service response time in milliseconds as a function of n. For
comparison purposes, we also modified our authentication service to simulate its
performance if it used "normal" state machine replication, where each server indi-
vidually computes, digitally signs, and sends its reply to the client and the client
6In these tests we used the C implementation of modular exponentiation provided with the
RSAREF toolkit, licensed free of charge by RSA Data Security, Inc. The RSAREF toolkit
was developed to support privacy-enhanced electronic mail, not interprocess communication,
and faster software implementations of modular exponentiation are available from RSA Data
Security, Inc. and others. However, with any software implementation of which we are aware, the
cost of modular exponentiation would continue to be the primary factor limiting performance in
our service.
24
authenticates k of these replies. (Again, all modular exponentiations were removed
from the servers.) The performance of this approach is shown by the dashed line.
As illustrated in part (b) of Figure 2.3, state machine replication performed
slightly better than our approach in most cases. This is due to the round of server
communication in our approach to disseminate partial results. In the tests of Fig-
ure 2.3, partial results were disseminated using a protocol that ensures Receipt
Validity only, as suggested in Section 2.4.3. To be precise, partial results were
communicated between servers using authenticated point-to-point channels; the
cryptographic mechanisms used for authentication were implemented entirely in
software. Given our assumption of hardware for modular exponentiation, this dis-
semination could possibly be optimized by exploiting this hardware. For instance,
each server could be initialized with its own RSA private key and the public keys
of the other servers, and partial results could be authenticated using RSA digi-
tal signatures. Moreover, this opens the possibility of using hardware multicast
to disseminate partial results, as the same digitally signed partial result could be
multicast to all servers at once.
While the increase in response times as a function of n in our approach stems
from server communication, the increase for state machine replication results from
the client having to verify multiple signatures. In these tests, all public RSA expo-
nents were set to three (i.e., e = 3) for maximum efficiency in verifying signatures.
With the RSA software we employed, signature verification then cost approxi-
mately lOms per verification. This increased cost on the client processor gives us
the opportunity to make an important point: while the response time of our service
is slightly worse, the client processor is free for all but the time required to make
a request and to verify one signature.
It is risky to conclude too much from part (b) of Figure 2.3. We reiterate
that the single most important factor in the performance of both our approach
and state machine replication, namely the speed of modular exponentiation at the
25
servers, was removed from these tests. Moreover, there are several other factors--H
including the optimizations on the dissemination of partial results described above
and the existence of faster software implementations of signature verification with
e = 3 (which will cause the dashed curve to flatten somewhat)--Hthat could impact
these tests. In addition, the authentication service used in these tests is atypical
in that it does not require that client requests be delivered by an atomic broadcast
protocol. (However, since atomic broadcast is generally a requirement of both our
approach and state machine replication, the impact on both should be the same.)
Nevertheless, part (b) of Figure 2.3 does indicate that with efficient hardware
support for modular exponentiation, the protocol of Section 2.4.2 may provide
acceptable response times for some services. This, combined with the additional
features of our scheme (e.g., insulating clients from the service implementation
and removing burden from client processors), supports the hypothesis that our
approach can be a reasonable alternative to state machine replication in some
situations.
2.5 Preserving input causality
One guarantee provided in the previous section is that any response accepted at
a correct or faulty client was computed by a correct server. Even the output of a
correct server, though, may not reflect the way things "should be" if an intruder
has caused the service to deliver improper requests or to deliver requests in an
incorrect order. In general, ensuring proper responses from a correct server requires
access control, because responses computed from state variables that can be written
(directly or indirectly) by corrupt clients cannot be trusted. Access control is an
entire research area in itself and will not be discussed further here.
In tliis section we address the issue of ensuring that requests are delivered
in a correct order by correct and faulty servers. Because we assume an atomic
broadcast protocol to disseminate client requests, we concern ourselves only with
26
the requirement that servers deliver requests in an order consistent with causality
(see Section 2.2). A common method of attempting to preserve causality among
client requests is for each client to refrain from communicating between the time
it issues a request to the service and the time at which the request is delivered at
some faulty or correct server [Sch9O]. While this suffices to ensure that requests
from the same client are delivered in the order issued, this does not suffice to ensure
a causal delivery order for all requests. In particular, consider the case in which a
correct client issues a request to the service, and after seeing the request, a corrupt
server sends a message to a corrupt client. If the corrupt client subsequently issues
a request, then there is a causal relationship between the two requests. However,
it is not clear how this relationship can be detected by correct servers.
To illustrate why this may be important, we borrow an example from [RBG92]:
suppose that the service is a trading service that trades stocks, and that a client
issues a request to purchase shares of stock through this service. After discover-
ing the intended purchase, a corrupt server could collude with a corrupt client as
described above to issue a request for the same stock to the service. If the correct
servers deliver this request before that of the correct client, this request may ad-
just the apparent demand for the stock and raise the price offered to the correct
client. Thus, by allowing the causally subsequent request of the corrupt client to
be delivered before the request of the correct client, a type of "insider trading"
may occur. Moreover, access controls alone cannot naturally avoid this problem,
as the intent is that any client can request to purchase stock at any time.
In the rest of this section, we present new routines request'(m) and deliver'((c, m?)
that replace request(m) and deliver((c, rn?), respectively. Thus, if these are used
with the respond' and accept' routines of Section 2.4, processes will be structured
as in Figure 2.4. These new routines protect clients from the type of attack de-
scribed above, in the sense that any request based on information obtained from a
correct or faulty client's request (c, m? can be delivered at correct or faulty servers
27
only after (c, m). As before, we will use deliver to deliver a request in our imple-
mentation of deliver', and we will use request to issue a request in request'.
(a) Server
Application Module
respond'(c, m)
deliver'((c, m))
Coordination Module
t abcast(m) send(?, m)l
Communication Module
____________t____________
Network
(b) Client
Application Module
request'(m)
accept'(m
Communication Module
Network
Figure 2.4: New structure of processes
In the implementation of request'(m), the client c encrypts m under a public
encryption key of the service before issuing the request to the service. Then, c is
provided the following guarantee if it is correct or faulty. The reader should verify
that this guarantee prevents the aforementioned problem.7
Causality: If deliver((c', m')) is executed at a correct or faulty server s when
deliver((c, m?) has not yet been executed at s, then (c1, m') was issued before
m was decrypted anywhere (other than c).
In addition to satisfying Causality, request' and deliver' must also ensure that
client requests are delivered according to the specification of atomic broadcast
7We do not consider traffic analysis [VK83j or attacks that exploit the malleabihty of the
cryptosystem [DDN9l].
28
i.e., that Delivery Validity (with request replaced by request') and Delivery Order
still hold. The implementations of request' and deliver' that we propose satisfy
Delivery Order and part 1 of Delivery Validity with no further assumptions. To
ensure part 2 of Delivery Validity, we require n > t t 2b.
2.5.1 Threshold cryptosystems
Our deliver' routines at the servers will employ a (k, n)-threshold cryptosystem. A
(k, n)-threshold cryptosystem is, informally, a method of generating a public key
and n shares of the corresponding private key in such a way that for any message m
encrypted under the public key, each share can be used to produce a partial result
from the ciphertext of m, where any k of these partial results can be combined to
decrypt ut. Moreover, knowledge of k shares should be necessary to decrypt m,
in the sense that without the private key it should be computationally infeasible
to (i) decrypt m without k partial results for ra, (ii) compute a partial result for
m without the corresponding share, or (iii) compute a share or the private key
without k other shares.
As with threshold signature schemes, cryptanalytic attacks against threshold
cryptosystems differ from those against conventional public key cryptosystems in
that they may involve the use of partial results and some number of shares, in
addition to plaintext/ciphertext pairs. For our purposes, we will say, informally,
that a (k, n)-threshold cryptosystem is secure if it satisfies the following properties:
1. Possession of only k --H 1 or fewer shares and of partial results for various
ciphertexts does not facilitate decrypting new ciphertexts. That is, if a pos-
sessor of such information can decrypt a new ciphertext, then it could also
decrypt that ciphertext without knowledge of any shares or partial results
for any ciphertexts. This property, which is formalized in [FD92J, says that
the threshold cryptosystem is as secure as the conventional cryptosystem on
which it is built.
29
2. The conventional cryptosystem on which the threshold cryptosystem is built
is resistant to known plaintext attacks. That is, the cryptanalyst cannot
manage to decrypt given ciphertexts even though it can see the plaintext
corresponding to various other ciphertexts (but not corresponding to cipher-
texts of its choice, as would be possible in a chosen ciphertet attack).
As we will see in Section 2.5.2, we would actually prefer a threshold cryptosystem
that is based on a conventional cryptosystem able to tolerate chosen ciphertext
attacks. However, the above definition is in accordance with the security of all
implementations of threshold cryptosystems thus far proposed, in the sense that
all proposed implementations are based upon conventional cryptosystems that are
known to be vulnerable to chosen ciphertext attacks.
Because the acts of signing a message and decrypting a message are oper-
ationally identical in the RSA signature scheme and cryptosystem, one imple-
mentation of a (k, n)-threshold cryptosystem can be obtained directly from the
(k, n)-threshold signature scheme described in Section 2.4.1. Messages would be
encrypted under the public key (e, N) of the service in the usual manner, and
the i-th partial result for an encrypted message m (mI)e mod N would be de-
fined precisely as in Section 2.4.1--Hi.e., am,i mKi mod N. Then, m' Am,T
m Hi?T(am,i)Pi?T mod N for any T of size k. Other implementations of thresh-
old cryptosystems have been proposed, based upon both the RSA and ElGamal
[ElG85] cryptosystems [DF9o,Lll91].
2.5.2 The protocol
Suppose that we are using the RSA threshold cryptosystem described above and
that we have the initial conditions assumed in Section 2.4.2: the i-th server 5j is
secretly given sole possession of K?,the cryptosystem threshold parameter k = bt 1,
any process can reliably obtain the public key (e, N) of the service, and all servers
know (a priori) Pi,T for all i and T. The basic idea of our protocol is that each
30
client c encrypts the contents m of its request with the public key of the service,
in an attempt to force k servers to cooperate to decrypt it. Then, each correct
or faulty server refrains from broadcasting its partial result for the ciphertext of
m until the delivery sequence through (c, m) is fixed locally. In this way, once a
corrupt server collects k partial results for the ciphertext of m, the sequence of
requests through (c, m) has been fixed at some correct or faulty server, and thus
at all correct servers, and no requests can be placed before (c, rn? in the delivery
sequence at correct or faulty servers.
This protocol preserves Causality iff each server requires k partial results for the
ciphertext of m to decrypt that ciphertext. Even under the assumption that this
cryptosystem is secure, however, this unfortunately is not the case with this or any
implementation of a (k, n)-threshold cryptosystem proposed thus far. The problem
is that our protocol as described above allows a corrupt server to mount chosen
ciphertext attacks, against which neither the RSA nor the ElGamal cryptosystem
(nor any proposed threshold cryptosystem based upon them) is resistant. In our
setting, a corrupt server can see at any time how any ciphertext m of its choosing is
decrypted, by simply requesting that a corrupt client c issue (c, m? as an apparently
legitimate request to the service. The corrupt server can then collect k partial
results for m to see the plaintext to which m decrypts.
Methods of using chosen ciphertext attacks against the RSA and ElGamal
cryptosystems are well-known. Here we illustrate one method, originally due to
Moore (see [Den84]), by which a corrupt server can decrypt the ?SA ciphertext
m (ml)e mod N in a correct client's request (c, m) without waiting to receive k
partial results for m.
1. The corrupt server chooses an arbitrary x and computes y xe mod N; i.e.,
yd mod N.5
8The obvious form of this attack, in which x			1 mod N, can be made unproductive for
the corrupt server: if each client signs the contents of its request before encrypting it and servers
deliver a request on behalf of the client whose signature is on the decrypted contents (and not
31
2. Via a corrupt client, the corrupt server issues a request with contents
ym mod N to the service.
3. The corrupt server collects k partial results for vm mod N and forms
(ym)d mod N.
4. The corrupt server computes x?l = y?d mod N, and then
x?l(?m)d = y?dydmd = md = (mt)ed = m' mod N
(NB: If xdoes not have an inverse mod N, then the corrupt server can factor
N and break tfle crvDtosvstem. because gcd(x,N) is a prime factor of N.)
Similar attacks are possible with the threshold cryptosystems in [DF9O,LH91].
Ideally, we would like to find a threshold cryptosystem based on a conven-
tional public key cryptosystem that is tolerant of chosen ciphertext attacks. As
already mentioned, however, no such threshold cryptosystem has been proposed,
and even conventional public key cryptosystems that are tolerant of such attacks
(e.g., [DDN91]) are impractical. Therefore, such attacks must be prevented. (We
also describe a way to detectthese attacks in Section 2.6.3, which may suffice in
some situations.) A simple way to prevent these attacks is to have each client use
a separate public encryption key for the service. That is, the public key for the
l-th client c1 would be a pair (e1,N1), and each server Sj would be given a share
Kj,? for use when cooperating to decrypt a request from client c?. This prevents
chosen ciphertext attacks against the keys and ciphertexts of correct and faulty
clients, because any request that arrives from a corrupt client will be decrypted
using the shares of the key for that client, and not a correct or faulty one. The
practical ramifications of this design, as well as ways to improve it in practice, are
discussed in Section 2.6.
necessarily the client that issued the request), then by issuing a request containing m,the corrupt
server only expedites the delivery of the correct client's request.
32
The one remaining problem in our protocol is that corrupt servers can pro-
vide incorrect partial results that may disrupt decryption. Therefore, it must be
possible for a server to determine when it has properly decrypted a request. To
facilitate this, clients are required to send with each request a message digest of
the request. (See Section 2.4.1 for a description of message digests.) We assume
that the included message digest uniquely identifies, but does not disclose, the
contents of the encrypted message, and that the joint use of message digests and
the threshold cryptosystem does not reveal information that a cryptanalyst could
use to circumvent the properties of either one. The validity of this assumption
obviously depends on the chosen implementation of message digests.
Then, the request' and deliver' routines execute as follows.
Routine request'(m) at client ci:
1. Create the RSA ciphertext m' = mel mod N1 and message digest D =
f(m) of m.
2. Execute request((m', D?).
Routine deliver'((c1, m?) at server s?:
1. If m is not of the form (m', D?, then return to the calling routine.
2. If this is the h-th execution of deliver', then execute abcast((h, ami,i?),
where aml,i = (m?)Ki?l mod N! is $j'5 partial result for m'.
3. Wait until a set of messages ?(s?, (h, ?j??ljET? IT = k, is received such
that (i) each element (sj, (h, c??? is the first message received of the
form (Sj,(h,%")? for any %`, and (ii) f(AmI,T) = D where Am1,T
lljET(?)PiT mod N1. But, wait only until messages of the form
(s?, (h, ???? are received from k + b unique servers s?. If at this point
there is no set ?(sj, (h, oj')')Yj??, IT = k, satisfying (i) and (ii), return
to the calling routine.
33
4. Execute deliver((c1, Am',T?) and return to the calling routine only after
it has completed.
Theorem 2.3 This protocol satisfies Delivery Order.
Proof. By Delivery Order with deliver replaced with deliver', all correct servers
execute the same sequence of deliver' calls. Moreover, by Receipt Order, all correct
servers receive the same sequence of messages. Thus, all correct servers use the
same partial results to attempt to decrypt a request and take the same action on
each execution of deliver'. That is, the correct servers execute the same sequence of
deliver calls. In addition, because the sequences of deliver' and receive executions
at a faulty server are prefixes of those at a correct server, a faulty server takes
the same action in any execution of deliver' that a correct server does, or never
executes deliver again. E]
Theorem 2.4 This protocol satisfies part 1 of Delivery Validity with request re-
placed by request'
Proof. Suppose that c is correct or faulty, and that a correct or faulty server
executes deliver((c, m?). By construction, this could be executed only from an
execution of deliver'((c, (m', D')?) where f(m) = D. By Delivery Validity with
deliver replaced with deliver' deliver'((c, (m', D??) could be executed only if c
executed request((m', D?). Finally, since D uniquely identifies m, c must have
previously executed request'(m). E]
Theorem 2.5 If n > t + 2b, then this protocol satisfies part 2 of Delivery Validity
with request replaced by request'
Proof. If n > t + 2b, then for each execution of deliver' at correct servers, at least
2b + 1 = k + b partial results are broadcast by and, therefore, received at all correct
servers. So, at a correct server, each execution of deliver' eventually completes
and returns to the calling routine; i.e., the service makes progress. Then, Delivery
34
Validity with deliver replaced with deliver' implies that if a correct client c executes
request'(m), then deliver?((c, (m', D??) is eventually executed at all correct servers,
where m' is the ciphertext of m and D = f(m). Since any set of ktb partial results
for m' from k + b different servers contains at least k partial results from correct
and faulty servers, m' can be decrypted and m can be delivered. ?
From the above theorems, we immediately have the following.
Theorem 2.6 This protocol satisfies the specification of atomic broadcast (for
client requests) if n > t + 2b.
We now prove that the above protocol satisfies Causality.
Theorem 2.7 If the threshold cryptosystem is secure, then this protocol satisfies
Causality.
Proof. Suppose that a correct or faulty client c executes request'(m). If c does
not fail for sufficiently long, then c executes request((m", D?), where m" is the
ciphertext of m and D = f(m). Suppose that deliver((c', m'?) for some request
(c', m'? is executed at a correct or faulty server 5 when deliver((c, m?) has not yet
been executed at 5.
We claim that deliver((c', m'?) is executed at 5 when deliver'((c, (m", D??) has
not yet been executed at 5. To see why, suppose otherwise, and consider the exe-
cution of deliver'((c, (m", D??). First suppose that k partial results are received in
step 3 of deliver' that enables 5 to decrypt m" and execute deliver((c, m?). This
contradicts the assumption that deliver((c', m'?) is executed at 5 when deliver((c, rn?)
has not yet been executed at 5. Now suppose that no such k partial results are re-
ceived in step 3 of deliver'. In this case, 5 must not receive k + b partial results from
k+b unique servers, and so deliver'((c, (m", D??) never completes. This contradicts
the assumption that deliver((c', m'?) is executed at 5, because deliver'((c, (m", D??)
must complete before deliver' can be executed again. Therefore, deliver((c', m'?)
35
must be executed at s when deliver'((c, (m", D))) has not yet been executed at 8.
Then, by Receipt Order and Delivery Order with deliver replaced with deliver',
if any correct or faulty server executes deliver'((c, (mtt, D??), then it must have
previously executed deliver( (c', m'?).
If the threshold cryptosystem is secure, then the earliest point at which m"
could be decrypted anywhere is sometime after some correct or faulty server broad-
casts its partial result for m", in deliver'((c, (m", D?)). If no correct or faulty server
ever broadcasts its partial result for m", then Causality is trivially satisfied. Oth-
erwise, let s' be the first correct or faulty server to broadcast its partial result for
m". Since deliver((c', m'?) is executed at St before deliver'((c, (m", D??), it follows
that deliver((c', m'?) is executed at 8? before m" is decrypted anywhere. So, (c', m'?
must have been issued before m" was decrypted anywhere. El
2.5.3 Discussion
In a run with no failures or corruptions, the replacement of deliver with deliver'
results in an additional n executions of abcast, which can be executed concurrently.
Thus, when this protocol is used to disseminate client requests and the protocol
of Section 2.4.2 is used to sign responses, the total message complexity is 2n t 1
broadcasts (n of which can satisfy Receipt Validity only; see Section 2.4.3) plus
the number of responses, structured in four communication phases.
As in the protocol of Section 2.4.2, step 3 of deliver`is potentially expensive,
because it may require a server to sort through (k+kb) subsets of partial results to
be able to decrypt a request. In fact, a corrupt client can force each server to sort
through (k+kb) subsets by sending a ciphertext and digest that do not "match."
As before, we rely on a small b and a small number of corrupt servers in the
common case to reduce the expected cost of decrypting requests from correct or
faulty clients. A bad request from a corrupt client can be detected in a reasonable
amount of time if b is small, and then subsequent requests from that client can be
36
ignored. In Section 2.5.4, we discuss a cryptographic advance that could eliminate
the exponential cost of this search as a function of b, although we reiterate that we
do not view this as essential for our protocol to be of use in real systems.
Assuming that service responses are verified at all clients with a single public
key but that each client that desires Causality has an additional public encryption
key for the service, each client possesses at most two public keys for the service.
Each server now possesses one share and modulus per client desiring Causality, in
addition to the share and public key with which it cooperates to sign responses
and the O(n log n) bits of public information needed for computing the Pi,T'5, (The
same Pi,T'5 can be used for all instances of the signature scheme and cryptosystem.)
A more general treatment of detecting causal relationships in hostile environ-
ments can be found in Chapter 5. In the framework presented there, the attack
addressed by the protocol of Section 2.5.2 is termed causal denzal. Chapter 5 also
presents a class of attacks called causal forgery, but in the context of this chap-
ter, preventing causal denial is sufficient to satisfy the causal requirements on the
delivery of requests to the servers. While the protocol of Section 2.5.2 does not
fully prevent denial in the sense of Chapter 5, we believe that the Causality guar-
antee should suffice for virtually all applications. The interested reader should see
Chapter 5 for a detailed description of these attacks and measures to address them.
2.5.4 Alternative approaches
In this chapter we have concentrated on techniques that do not require clients to
be able to authenticate servers individually or to be aware of the internal security
policies of the service. If we can require these, however, or if we can strengthen
other parts of our system model or the cryptosystems available to us, then other
alternatives for preserving Causality become possible.
For instance, suppose that a client knows b and has a public encryption key
for each server. Then, one alternative to the protocol of Section 2.5.2 employs a
37
(k, n)?secret sharing scheme [Sha79j, which is a scheme for breaking a secret m
into n shadows such that any k shadows can be used to reconstruct m but fewer
than k cannot. In request'(m) a client could use such a scheme to break m into n
shadows with k = b + 1, encrypt the i-th shadow under the public key of the i?th
server, and issue the encrypted shadows and a message digest of m as its request.
To enable servers to detect chosen ciphertext attacks, each client could include its
identity with each shadow before encrypting the shadow. Then, if in deliver' each
correct server broadcasts the shadow that it can decrypt iff the identity with the
shadow matches the requesting client (and a "bad shadow" message otherwise),
Causality can be achieved. While this approach places a larger burden on each
client, it does offer the advantage that all clients can use the same public keys for
the servers. Moreover, some improvements might be possible if the secret sharing
scheme is verifiable (e.g., [Fel87]), in the sense that servers can detect if the client
or other servers provide incorrect shadows.
Other options become available if the entire system of clients and servers is
synchronous, i.e., if there are known bounds on all message transmission times and
processes' rates of execution. In this case, maximum latencies for (deterministic)
protocols between clients and servers can be predicted. Thus, alternatives that
employ multiple interactions with clients become possible, because an unresponsive
client is detectable to the servers and need not hinder progress of the service. For
instance, to make a request a client could issue only a message digest of the request
to the servers. Once the client knows that deliver' has been initiated for its request
at a correct server (which it knows because, e.g., there is a known maximum latency
until this occurs, as in [CASD85,GSTC90]), the client could then provide its actual
request. The correct servers would deliver the request only if the message digest of
the request matches the digest with which deliver' was initiated. If the digests do
not match or if the servers receive no actual request within some maximum latency
period, deliver' would return without delivering a request. While this solution is
38
attractive in that it entirely avoids the use of cryptosystems, it only guarantees
Causality for correct (but not faulty) clients: if a client failed while sending its
actual request, there is a risk that only corrupt processes would obtain the actual
request and the correct servers would return from deliver' without delivering the
request. This would then allow a corrupt client to issue a causally subsequent
request that would be delivered before the faulty client's.
Verifiable secret sharing schemes, mentioned above, bring to mind the notion
of a vernfiable threshold cryptosustem, which to our knowledge was first proposed
by Yair FYankel. In addition to the properties discussed in Section 2.5.1, such a
scheme would enable servers to verify that partial results from other servers were
created correctly. To our knowledge, no implementation of such a scheme has been
proposed. However, in principle this detection capability could enable us to make
several improvements to the protocol of Section 2.5.2, such as ensuring that part
2 of Delivery Validity holds even if t + b ? n ? t + 2b, disseminating partial results
by a broadcast that satisfies Receipt Validity only, and eliminating the (worst
case) exponential growth as a function of b of the time to search for partial results
that decrypt a request in deliver'. Thus, like threshold cryptosystems tolerant of
chosen ciphertext attacks (see Section 2.5.2), verifiable threshold cryptosystems
are a cryptographic advance that could possibly improve our techniques.
2.6 Implementation issues
In this section we comment on three practical issues that bear on the implemen-
tation and use of our protocols. We do not discuss all relevant issues; several
important problems, such as the periodic changing of keys or the distribution of
key shares, have purposely been omitted from discussion because they are similar
to problems encountered in other cryptographic systems and can be solved using
known techniques. Instead, our comments here are restricted to issues that may
not be pertinent to most other cryptographic protocols.
39
2.6.1 Atomic broadcast
Like state machine replication [Sch9Oj, our techniques assume the existence of an
atomic broadcast protocol, whose specification is given in Section 2.3. It is well-
known, however, that it is impossible to find a deterministic solution to consensus,
and thus atomic broadcast, in an asynchronous system that can suffer even a
single failure [FLP85]. Thus, to implement atomic broadcast, either the underlying
system must be made synchronous or a randomized protocol must be used.
Atomic broadcast protocols have already been devised for synchronous sys-
tems (e.g., [CASD85,GSTC90,Sch9O]) and have been implemented in some efforts
(e.g., [SES+92]). With such a protocol, only the subsystem of servers must be
synchronous, because an atomic broadcast protocol for client requests can be im-
plemented using an atomic broadcast protocol for the servers alone, as outlined
in Section 2.3. Thus, this approach might be feasible in a situation where the
servers can communicate by a highly reliable and predictable network, but where
the clients are more widely dispersed and unable to communicate with the servers
by such predictable means. Moreover, if the subsystem of servers is synchronous,
then the protocol of Section 2.5.2 can easily be modified to satisfy Delivery Validity
evenift+6< n < t+2b.
There has been less work on atomic broadcast protocols for asynchronous sys-
tems. However, in a private communication in May 1992, T. D. Chandra described
a method to transform any solution to asynchronous consensus into a solution to
asynchronous atomic broadcast. Since many (randomized) solutions to consensus
exist for asynchronous systems (see [CD89J), this translation yields atomic broad-
cast protocols for asynchronous systems.
Several systems have circumvented the impossibility of deterministic, asyn-
chronous atomic broadcast in models where (only) failures can occur by employing
an appropriate membership protocol (e.g., [MPS92,Ric92J). A membership proto-
col is used to remove a process from participation in the broadcast protocol if it
40
appears to be faulty. While this risks the exclusion of a correct process from the
broadcast protocol, it eliminates the factor that makes atomic broadcast impos-
sible in asynchronous systems, namely that it is impossible to determine whether
a process has actually failed or is only very slow. We are presently investigating
membership protocols appropriate for use in our system model that would enable
us to achieve a variation of atomic broadcast that is sufficient for our purposes.
We should also note that for services in which requests commute and input
causality is not a concern, atomic broadcast is not necessary. For instance, this
was the case in the authentication service discussed in Section 2.4.4.
2.6.2 Public encryption keys
Recall that the protocol of Section 2.5.2 requires that each client have a different
public encryption key for the service, in order to prevent a corrupt server from
mounting chosen ciphertext attacks against correct or faulty clients' ciphertexts
or keys. In general, having separate cryptosystem parameters for each client may
require that each client be individually "registered" with the service before using
it, so that the client can obtain a public key for the service and the administrator
of the service can provide to the servers the shares necessary to decrypt this client's
requests. While this could limit the settings in which our protocols can be used, we
believe that this is not unreasonable if maintaining causality among client requests
is a substantial concern.
Nevertheless, in practice it may be possible to relax the requirement that each
client have its own public encryption key for the service. For instance, if a group of
clients mutually trust one another to not participate in chosen ciphertext attacks
against one another, then the members of that group could use the same public
encryption key for the service. To illustrate this, recall the trading service example
of Section 2.5. Assuming that a broker will not cooperate in chosen ciphertext
attacks against other brokers in the same firm, all brokers in a single firm could
41
use the same public encryption key for the service, i??ulting iii ct ?i"iplification in
key management and a reduction in storage requirements at the servers. Another
way in which the requirement that each client have its own public encryption key
could be relaxed is described in Section 2.6.3.
2.6.3 Detection of corrupt processes
One resource that we have not fully exploited in our protocols is the ability to
detect corrupt processes based on messages received from those processes. Such
techniques have been used, e.g., in numerous Byzantine agreement protocols to
isolate corrupt processes. In practice, methods to detect and quarantine corrupt
processes could be used to deter such processes from mounting attacks and to
improve overall system performance.
One example of how a corrupt client can be detected was given in Section 2.5.3:
if a correct or faulty server executes deliver'((c, m)) but this routine returns with-
out delivering a request, then c must be corrupt. In particular, the chosen cipher-
text attack illustrated in Section 2.5.2 would lead to the detection of the corrupt
client, because that client could not compute f((ym)d mod N) without knowing
(ym)d mod N (and thus deliver' would return after the third step). Moreover, this
test would detect any client issuing any ciphertext for which it did not know the
corresponding plaintext. If this threat of detection is believed to be sufficient to
deter chosen ciphertext attacks, then all clients could be allowed to use the same
public encryption key for the service.
Our protocols could also benefit from techniques to detect corrupt servers. For
instance, in Section 2.5.4 we sketched several improvements to our protocols that
would result from the ability to detect when a server has provided an incorrect
partial result. Other detection techniques and ways to exploit them are topics of
ongoing research and will not be discussed further here.
42
2.6.4 Summary
While the requirements that an underlying atomic broadcast protocol be available
and that each client use a different public encryption key may seem to reduce
the practicality of our protocols, in many real systems these will not constitute
substantial restrictions. We have outlined approaches to implementing atomic
broadcast, as well as methods for relaxing the requirement that each client have its
own public encryption key for the service. We believe that, when combined with
techniques to detect and isolate corrupt processes, our protocols can be applied in
many types of systems to achieve fault-tolerant and secure service implementations.
2.7 Related work
This work was largely inspired by [Gon89a], which presents a replicated, shared
key authentication service. The authentication service described there allows two
principals to establish a secret, shared encryption key provided that n > t t
b. The method discussed in the present work cannot immediately be applied to
construct such a service, because of the additional secrecy requirements. However,
as discussed in Section 2.4.4, our method has been used to construct an analogous
public key authentication service. Moreover, unlike our scheme, the scheme in
[Gon89a] requires each client to authenticate each server individually.
Using the state machine approach to construct services tolerant of some server
corruptions was first considered in [Lam78a]. Since then, other authors have fo-
cused on secure replication of data. Secure data replication using quorum methods
is considered in [HT88] for the case in which both data integrity and secrecy are
important. In these schemes, however, clients are again expected to be able to au-
thenticate data repositories individually. In [Rab89], a space-efficient information
dispersal algorithm is developed to facilitate the provision of data integrity and
availability. The scheme decomposes a file F into n pieces, each of size Fl/i, such
that any 1 pieces suffice to reconstruct F.
43
2.8 Summary
We have presented a ntethod for securely replicating services using a modification
of the state machine approach. Using our technique, a service can be replicated
as n servers, n > t t b, so that a client will always accept a response computed
by a correct server, and will not accept other responses. We have also addressed
the issue of maintaining causality among client requests. We illustrated a security
breach resulting from an intruder's ability to violate causality, and we presented an
approach to counter this problem. This approach requires that n> t t 2b to guar-
antee that the service makes progress. An important feature of our methods is that
they free the client of the responsibility of authenticating servers. This is achieved
by employing two recent advances in cryptography, namely threshold cryptosys-
tems and threshold signature schemes. We have argued for the feasibility of our
techniques through a discussion of several issues pertinent to their implementation
and use, and through an example that illustrates their performance.
Chapter 3
Fault-tolerant key distribution
3.1 Introduction
In open networks, an intruder can attempt to initiate spurious communication in
two ways [VK83]: it can try to initiate communication under a false identity, or it
can replay a recording of a previous initiation sequence. Many authentication or
key distribution protocols have been proposed to protect against these attacks (e.g.,
[NS78, DS8l, 0R87, SNS88]). These protocols allow principals (e.g., computers,
users) initiating communication to verify each others' identities and the timeliness
of the interaction. Most also arrange for the involved principals to share a secret
cryptographic key by which subsequent communication can be protected, or to
possess each others' public keys, by which communication can be protected or a
shared key can be negotiated.
Authentication protocols typically employ a trusted service, commonly called
an authentication service [NS78], to counter the first type of attack. In shared key
protocols, the authentication service normally shares a key with each principal and
uses these keys to distribute other shared keys by which principals communicate.
In public key protocols, the authentication service usually has a well-known public
key and uses the corresponding private key to certify the public keys of principals;
services of this form are also called certification authorities [BAN89j.
44
45
A predominant technique to detect replay attacks in authentication protocols
is to incorporate into each protocol message the time at which the message was
generated; the message is then valid for a certain lifetime, beyond which it is
considered a replay if received [DS81]. Timestamp-based replay detection has been
used in several systems (e.g., [SNS88,TA91]) and is often preferable to challenge-
response techniques [NS78], because it results in fewer protocol messages and less
protocol state. However, using timestamps requires that all participants maintain
securely synchronized clocks. In practice, clock synchronization is usually achieved
via a time service, as in [GZ84,Mil89].
The dependence of authentication protocols on authentication and time services
raises troubling security and availability issues. First, the assurances provided by
authentication protocols directly rely on the security of these services, and thus
these services must be protected from corruption by an intruder. Second, the un-
availability of these services may prevent principals from establishing secure com-
munication, or even open security "holes" that could be exploited by an intruder.
For instance, the unavailability of a time service could result in clocks drifting far
apart, thereby exposing principals to replay attacks. To increase the likelihood of
these services being available, they could be replicated. However, as already noted
in Chapters 1 and 2, in many environments this is dangerous, because replicating
services makes them inherently harder to protect.
In this chapter we present techniques to reconcile the conflict between secu-
rity and availability in these services. By using replication only when necessary,
and introducing novel replication techniques when it was necessary, we have con-
structed these services to be easily defensible against attack. And, the transient
unavailability of even a substantial number of servers does not hinder key distribu-
tion between principals or expose protocols to intruder attacks. Client interactions
with the services are simple and efficient, and the services can be used with many
different authentication protocols.
46
We present our time service in Section 3.2 and our authentication service in
Section 3.3. The use of these services in the Horus security architecture is discussed
in Chapter 4.
3.2 The time service
The security risks of clock synchronization failures in authentication protocols are
well-known [DS8i,Gon92], and the need for a secure time service has been recog-
nized in several systems (e.g., [Mil89,BM9O]). We claim, however, that the case for
a highly available time service is not as clear. It is true that an extended period of
unavailability might cause principals to have increasingly disparate views of real
time. But, this in itself need not result in security weaknesses or inhibit communi-
cation too quickly. In evidence of this, the algorithm we propose by which clients
estimate real time allows key distribution to proceed securely even during a lengthy
unavailability of the time service. This has allowed us to explicitly not replicate
our time service so that it will be easier to protect, and to achieve resilience to a
time service unavailability through the client algorithm for estimating time.
We describe this algorithm in Section 3.2.1 and discuss alternatives to our
approach in Section 3.2.2. As will be discussed in Section 3.2.2, the algorithm of
Section 3.2.1 is heavily influenced by previous work in clock synchronization. As
such, its contribution mainly lies in how clock synchronization techniques can be
adapted for use in our setting to achieve simple, fail-safe time estimation in key
distribution protocols with an easily defensible, centralized time service.
3.2.1 The algorithm
Clients interact with our time service by the simple RPC-style protocol shown in
Figure 3.1. We assume that the time server possesses a private key KT whose
corresponding public key is well-known. (There is a similar shared-key protocol.)
At regular intervals, a client queries the time service with a nonce identifier N
47
[NS78], a new, unpredictable value. When the time server receives this request,
it immediately generates a timestamp T equal to its current local clock value and
replies with fN, TlKT, i.e., the nonce and the timestamp, both signed with KT.
The client considers the response valid if it contains N and can be verified with
the public key of the time service.
C?T: N
T?C: ?,T?T
Figure 3.1: Protocol by which client C interacts with time service T
The method by which a client uses this response rests on the following additional
assumptions:
1. The client has access to a local hardware clock H that measures the length
t --H t' of a real time interval [t', t] with an error of at most p(t --H tt) where p
is a known constant satisfying 0 ? p ? 1. That is,
(1 --H p)(t --H t') < 11(t) --H 11(t') < (1 + p)(t --H t'). (3.1)
If p is estimated too optimistically so that the actual drift rate of the client
is outside of [--Hp, p], then the client may be subject to replay attacks or
may subject others to replays of its messages. So, the value of p should be
estimated conservatively; e.g., a p of i0--H5 should be sufficiently conservative
for most types of quartz clocks [Cri89] 1
2. The time server's clock is perfectly synchronized to real time. We could incor-
porate time server drift into our formulas, although for all practical purposes,
1Keith Marzullo has suggested the possibility of dynamically measuring p on a per-client basis
[Mar93]. However, we do not pursue this here.
48
perfect synchronization can be achieved by attaching a WWV receiver or a
very accurate clock to the time server's processor via a dedicated bus. Alter-
natively, the time service could be viewed as defining time in the system.
3. There are known, minimum real-time delays min1 and min2 experienced,
respectively, between when a client initiates a request to the time service and
when the time server receives that request, and between when the server reads
its local clock value and the response is verified as authentic at the client.
In our implementation, min2 is substantially longer than mini, because it
includes the delays for signing and verifying the response.
Under these assumptions, the following theorem holds:
Theorem 3.1 Immediately after a client receives and verifies a response from the
time service, the client can characterize the current real time t by:
f ? [T + min2, T + r/(1 --H			--H mini],			(3.2)
where T is the timestamp in the response and r is the round trip time measured
by the client, beginning when it sent the request and ending at time t) i.e., after it
verified the response.
Proof. Let mini + ?i and min2 + ?2 be the real time delays experienced, re-
spectively, between when the client sent the request and the server received it,
and between when the server read its local clock and the client had verified the
response as authentic. Then, R mini + min2 + Ol + ?2 is the real round trip
time. Since Ci, 02 > 0, we have that 0 < 02 < 1? --H mini --H min2, and so after the
client verifies the response, real time t T + min2 + C2 satisfies
E [T+min2, T+R--Hmini]. (3.3)
By (3.1), it follows that R ? r/(1 --H p), and combining this with (3.3) gives us the
desired result.
49
Combining (3.1) and (3.2), the client can characterize any later time t > t by:
where
and
t ? [L(t), U(t)],			(3.4)
L(t) = (H(t) --H H(t))/(1 + p) + T + min2
U(t) = (H(t) --H HU))/(1 --H p) + T + r/(1 --H			--H mini.
To estimate the time, the client uses either L(t) or U(t), depending on which
is more conservative. In particular, to detect replays of authentication protocol
messages, principals use the following rules for estimating time:
1. When timestamping an authentication protocol message to allow others to
detect a later replay of that message, the sender sets the message timestamp
to T =
2. A recipient accepts an authentication protocol message with timestamp T as
valid at time t only if T + A > U(t), where A is the predetermined lifetime
of the message.
The benefit of this scheme is that it is fail-safe [5575], in the following sense:
Theorem 3.2 An authentication protocol message with lifetime A sent by a (cor-
rect) client at time t will never be accepted by another (correct) client after time
t+A.
Proof. Suppose a client sends an authentication protocol message at time t. The
timestamp T = L(t) for the message satisfies T < t. Now consider a recipient
at time t + A, where A is the lifetime of the message. Since at the recipient,
t + A < U(t + A), it follows that T + A ? U(t + A). Thus, the message will be
rejected as invalid. ?
50
Because the interval (3.4) grows wider with time, each client periodically resyn-
chronizes with the time service in order to narrow its interval. A successful resyn-
chronization results in new values of H(f), r and T for the calculation of U(1) and
L(t). Resynchronization attempts can fail, however, when the round trip time r for
the attempt exceeds some timeout value. When this happens, the client continues
to attempt to resynchronize with the service at regular intervals, while retaining
the values of T, r and H(t) obtained in the last successful resynchronization to cal-
culate L(t) and U(t). So, if the service becomes unavailable, clients' intervals will
continue to widen. If the service is unavailable for too long, eventually principals'
values of U(t) will exceed their values of L(t) by the protocol message lifetimes,
and all messages will be perceived as expired immediately upon creation.
While this bounds the amount of time that the system can operate without
the time service, calculations in our system indicate that this bound is not very
tight. For example, consider two principals P1 and P2, each of whose clocks is
characterized by p = iO--H5, and suppose for simplicity that the values of f and T
corresponding to the last resynchronization for each prior to a time service crash
are the same. Moreover, suppose that min1 = mim2 = 0 and that the value of r
obtained by P2 in its last resynchronization is .5 seconds. Then, even if the clocks
of P1 and P2 drift apart at the maximum possible rate i.e., the clocks of P1 and
P2 are as slow and as fast as possible, respectively, while still satisfying (3.1) it
will be over 20.4 hours from t before the value of U(t) at P2 exceeds the value of
L(t) at P1 by 30 seconds, a relatively short message lifetime in comparison to that
suggested in [DS8l]. In addition, since clocks usually do not drift apart at the
maximum possible rate, tests in our system show that a time service unavailability
can typically be tolerated for much longer. These results lead us to believe that
the system, if tuned correctly, should be able to operate without the time service
for sufficiently long to restart the time service, even if the restart requires operator
intervention. More comprehensive testing of this hypothesis is currently underway.
51
3.2.2 Comparison to alternative designs
We derived our algoritlim from that presented in [Cri89] for implementing a time
service. The primary difference between ours and that in [Cri89] lies in how clients
use the interval (3.2). In the latter, the client uses the midpoint of (3.2) as its
estimate of the time at time t? as this choice minimizes the maximum possible
error, and the client estimates future times as an offset, equal to the measured
time since the last resynchronization, from this midpoint.2 However, like any
other clock synchronization algorithm in which each client maintains a single clock
value, this algorithm is not fail-safe: e.g., if the midpoint of (3.2) were too low, then
the client's future estimates of the time would tend to be low, and thus expired
messages may be incorrectly accepted. We feel that our approach, which is fail-safe,
is better for our purposes.
A reasonable alternative to not replicating our time service is to replicate it
in such a way that it provides a correct service despite some server failures and
corruptions. For instance, a client could use the robust averaging algorithm of
[Mar90] to obtain an interval of bounded inaccuracy containing real time from
a set of n time servers, if fewer than [n/3j servers are faulty or corrupt. This
approach might be attractive if clients are highly transient and thus a time service
unavailability will prevent large numbers of clients from initially synchronizing
with the service at client startup. However, this is unlikely to be the case in most
systems, where time service clients are sites that do not tend to reboot frequently.
Moreover, a replicated time service places a larger burden on the administrator
of the service than does ours, as the administrator must protect multiple servers,
instead of only one, to ensure the integrity of the service. For these reasons and the
additional costs of replication (e.g., authenticating and maintaining multiple time
servers), we feel that a replicated time service is difficult to justify for our purposes.
2This is a simplification of the algorithm in [Cri89j; the actual algorithm also takes measures to
ensure that client clocks are continuous and monotonic. These features, however, are unimportant
for our purposes.
52
Also discussed in [Mar90] are approaches to evaluating a predicate on a physical
state variable despite the impossibility of accurately measuring that variable. It is
observed that given a range of values that is known to contain the actual physical
value, safe evaluation of the predicate may require that all values in the range
satisfy the predicate, or that only so?e value in the range does. Our approach
of estimating time conservatively with the endpoints of (3.4) can be viewed as an
instance of the former approach, where the physical state variable being measured
is time, the range containing time is (3.4), and the predicates of interest relate time
to timestamps in authentication protocol messages.
Numerous other approaches to clock synchronization have been proposed (see,
e.g., [SWL9o]), but for brevity, we do not discuss them all here. Unlike ours,
however, most assume upper bounds on message transmission times or employ
greater distribution, thereby increasing the number of components that must be
protected in the system. Moreover, to our knowledge none provide a fail-safe
algorithm for estimating time in authentication protocols. We thus feel that our
approach is unique in providing this property with relatively few requirements.
3.3 The authentication service
Like those in [TA9l,LABW92], our authentication service is the public key type,
that creates public key certificatesfor principals. A certificate ?P, T, Kp?1)KA con-
tains the identifier P of the principal, the public key Kp?1 of the principal, and the
expirationtime T of the certificate, all signed by the private key KA of the authen-
tication service. A principal uses certificates to map principal identifiers to public
keys, by which those principals (who presumably know the corresponding private
keys) can be authenticated; the details are discussed in [LABW92]. In general, a
principal can request a certificate for any principal from the authentication service.
The need for security in such an authentication service is obvious: as the undis-
puted authority on what public key belongs to what principal, the authentication
54
As indicated above, a natural choice for k is k = [n/2 + ii, so that a majority of
correct servers ensures both the availability and the integrity of the service.
Our technique employs a threshold signature scheme. Informally, a (k, n)
threshold signature scheme is a method of generating a public key and n shares of
the corresponding private key in such a way that for any message m, each share
can be used to produce a partial result from m, where any k of these partial results
can be combined into the private key signature for m. Moreover, knowledge of k
shares should be necessary to sign rn, in the sense that without the private key it
should be computationally infeasible to
1. create the signature for m without k partial results for m,
2. compute a partial result for m without the corresponding share, or
3. compute a share or the private key without k other shares.
The replication technique does not rely on any particular threshold signature
scheme. For our authentication service, we have implemented one in [FD92], which
is based upon RSA [RSA78].
Given a (k, n)-threshold signature scheme, we build our authentication service
as follows. Let A = fAS1,... , ASni be the set of servers. We first choose a
threshold value k and create n shares from the private key KA of the authentication
service. Each authentication server ASj, when started, is given the i-th share of
KA, its own private key KASi?th? public key corresponding to the private key
KAs?? of each server ASj, and the public keys for all principals. It is also given the
public key of the time service to synchronize its clock as in Section 3.2.1.
The protocol by which clients obtain certificates from the authentication service
is shown in Figure 3.2. A client C requests a certificate for a principal P by sending
the identifier for P and a timestamp T to the servers. The purpose of T is to give
the servers a common base time from which to compute the expiration time of the
55
certificate;3 we discuss how C chooses T below. When each server ASi receives the
request, it extracts T and tests if T is no more than its current value of L(t). If
this is the case, it produces its partial result pri(P, T + A, Kp1) for the contents
(P, T + A, Kp1) of P's certificate, where A is the predetermined lifetime of the
certificate. ASi then sends pr?(P, T + A, Kp1) to the other servers, signed under
its own private key. (Alternatively, partial results can be sent over point-to-point
authenticated channels as in Section 2.4.4, rather than being authenticated by
digital signatures.) When ASi has authenticated k --H 1 other partial results from
which it can create the certificate ?P,T + A, KP?11KA, it sends the certificate to
C. C accepts the first properly signed certificate for P with an expiration time
sufficiently far in the future (see below), and ignores any other replies.
C?A: P,T
(Vi) ASj A: (P, T + A, prj(P, T + A, KP?1)1KAsi
(Vi) AS, H C fP,T+ A,KP?1?KA
Figure 3.2: Protocol by which client C obtains a certificate for principal P
It is easy to see why this protocol provides the Integrity and Availability guar-
antees stated above. Informally, Integrity holds because if only fewer than k servers
are corrupted by an intruder, then the corrupt servers do not possess enough shares
to sign a certificate; i.e., they need the help of a correct server. Availability holds
because if at least k servers are correct, then the correct servers possess enough
shares to sign a certificate and can do so using this protocol.
Because each correct server produces a partial result only if T is no more than
3In a prior version of this protocol, each server used its value of L(t) when the request was
received as the base to compute the expiration time. This version was more sensitive to clock
drifts and variances in request delivery times. We are grateful to Brad Glade for suggesting the
inclusion of the timestamp in the first message of this protocol.
56
its value of L(t), where t is the time at which it receives the request, any certificate
produced from its partial result has an expiration timestamp of at most t + A.
A principal accepts a certificate as valid at some time t only if the certificate
expiration time is greater than the principal's value of U(t), which ensures that the
certificate expiration time has not been reached. So, like authentication protocol
messages (see Section 3.2.1), a certificate will never be considered valid for longer
than its intended lifetime.
A client's choice for T is constrained by two factors. On the one hand, for a
certificate to be produced, each of k different servers must find T to be at most
L(t), where t is the time at which the server receives the request; so, choosing T
too high prevents a certificate from being produced. On the other hand, since the
certificate's expiration time is T + A, the client shortens the effective lifetime of
the certificate by choosing T too low. So, a client should choose T to be close to,
but less than, what it anticipates will be the correct servers' values of L(t) when
they receive the request.
In practice, it works well to have a client, when sending a reqwest at time t,
to set T to its own value of L(t) minus a small offset ? > 0, and to increase 6
on subsequent requests if prior attempts to obtain a certificate failed. Because an
unavailability of the time service will generally cause clients' values of L(t) to drift
from those of the servers, during a lengthy unavailability a client may need to set
6 to several seconds to obtain a certificate, at the cost of reducing the effective
lifetime of the certificate by that amount. However, since certificate lifetimes are
typically at least several minutes, this would normally reduce the effective lifetime
by only a small fraction.
3.3.2 Comparison to alternative designs
As previously mentioned, we are not the first to notice the conflict between security
and availability in the construction of authentication services. In [Gon93], Gong
57
(usually			ODOl			Asi
CDC?			As?n
(a) Offline certification au-
thority (CA). CA deposits
long-lived certificates in
an online, replicated cer-
tificate distribution center
(CDC).
(b) State machine replica-
tion. Client authenticates
certificate from each server
and accepts key that oc-
curs in a majority of cer-
tificates.
Client
(c) Transparent state ma-
chine replication. Client
authenticates single cer-
tificate that could have
been created only by a ma-
jority of servers.
Figure 3.3: Design options for a fault-tolerant public key authentication service
proposed a method for dealing with this tradeoff in shared key authentication ser-
vices such as Kerberos [SNS88J. Lampson, et.al., [LABW92] described a different
solution that is appropriate for a public key authentication service similar to ours,
which they call a certification authority.
In the latter solution, which is also implemented in SPX [TA91], the certification
authority is not replicated and can even be taken offline, to make it easier to
protect (Figure 3.3a). To reduce the impact of its limited availability, it produces
long-lived certificates that are stored in and distributed from an online certificate
distribution center (CDC), which can be replicated for high availability [TA91].
Because certificates are long-lived, however, there must be some way to revoke
them securely. For this reason, certificates are obtained only from CDC replicas,
so if necessary, a certificate can be revoked by deleting it from all replicas. That is,
a client accepts a certificate only if both the highly secure certification authority
and a CDC replica endorse it. A disadvantage of this scheme, noted in [LABw92],
58
is that the corruption of a CDC replica could delay the revocation of a certificate.
This problem could be addressed by using our technique described in Chap-
ter 2 to securely replicate the CDC. However, our approach presented in Section
3.3.1 of securely replicating the authentication service itself (Figure 3.3c) addresses
this problem more directly. Since the authentication service is online and highly
available, it can refresh certificates frequently and create them with short lifetimes.
Thus, the window of vulnerability between the disclosure of a principal's private
key and the expiration of the principal's certificates can be greatly shortened, mak-
ing revocation of existing certificates less crucial. If the ability to revoke certificates
is still desired, however, our authentication service could easily be adapted to pro-
duce certificate revocation lists. And, of course, once the disclosure of a principal's
private key is discovered, the principal's public key can be removed from the au-
thentication servers so that no more certificates containing it are produced.
Our technique also has advantages over state machine replication [Sch90] (Fig-
ure 3.3b) of the authentication service (or the CDC of [TA91,LABW92]). First,
our approach requires less state at the client: the client needs only a single public
key for the service, and need not maintain secure channels to the servers or retain
any other replies but the first properly signed one. The last of these is especially
beneficial if a principal obtains and forwards its own certificate to its partners in
cryptographic protocols, as in the "push" technique described in [LABW92]. If
the service were implemented using state machine replication, the principal would
need to collect and forward a number of certificates equal to the size of a majority
of the servers. Second, our technique requires the client to authenticate less com-
munication, as the client can ignore all replies after the first one it accepts as valid.
Third, in our approach the configuration of the authentication service is largely
transparent to clients, and so servers can be added or removed more easily.
There is, however, at least one disadvantage of our scheme with respect to the
others mentioned here: due to the round of server communication, a client is likely
59
to wait longer for a response in our scheme. As will be illustrated in Chapter
4, though, in many situations communication with the authentication service can
be performed in the background, off the critical path of any other protocol or
computation, and in advance of any actual need for a certificate. (The certificate
so obtained is then cached until the need for it arises.)
3.4 Summary and future work
In this chapter we have considered the need for authentication and time services
to securely and fault-tolerantly support cryptographic key distribution. In our ap-
proach, we have chosen to replicate the authentication service, because by making
the authentication service highly available, we gain greater flexibility in choosing
certificate lifetimes and thus less reliance on a secure revocation mechanism than
approaches that use a nonreplicated service. To compensate for the security risks
of this replication, the service is built to tolerate a minority of server corruptions.
The time service is not replicated, so that it is easier to protect. Moreover, its
unavailability does not result in security breaches or hinder clients that continue
to operate correctly. In fact, it could be temporarily taken offline for further
protection if the need arises. While techniques exist for replicating time services
so that some number of server corruptions could be tolerated, we have found that
the additional costs of replication are difficult to justify.
An area for future work is adapting the services described in this chapter to
very large systems. The services in their current form cannot scale to very large
systems, for both security and performance reasons. In a large system, the ser-
vices may be overwhelmed, and there may not be a single authority trusted to
protect them. To alleviate this, an instance of these services could be employed
per administrative domain, as in [SN888,LABW92]. An alternative deployment of
the authentication service would be to place each domain in charge of a different
server. These directions hold many possibilities for future results.
Chapter 4
Secure process groups
4.1 Introduction
As discussed ill Chapter 1, the authentication and time services of Chapter 3 have
been implemented as part of our security architecture for fault-tolerant systems.
The architecture was developed in an effort to integrate security into the Horus
system. Like its predecessor Isis [BJ87b,BSS91J, llorus is a system for building
fault-tolerant applications using the process group model of computation [Bir93].
While the security architecture has been implemented in liorus, we expect that
its integration into Isis will be straightforward, and in principle it could also be
applied ill other group-oriented systems such as V [CZ85J, Amoeba [KT9i], and
Transis [ADKM92].
In this chapter, we discuss the design, implementation and performance of the
architecture in Horus. We begin by describing the programming model of secure
process groups that is supported by the security architecture, with an emphasis
on the security guarantees that augment the Horus process group abstraction. We
then discuss several aspects of the secure group implementation: the cryptographic
keys used in a group, the group join protocol, and secure group communication. As
group communication constitutes the vast majority of activity in most Isis appli-
cations today (and thus, we expect, in most Horus applications in the future), we
60
61
also compare the performance of group communication in the secure and insecure
versions of the system.
4.2 Programming model
The basic abstraction provided by llorus is the process group, which is a collection
of processes with an associated group address. Groups may overlap arbitrarily, and
processes may create, join and leave groups at any time. Processes communicate
both by point-to-point methods (e.g., RPC) and by group mu1t?cast, i.e., by mul-
ticasting a message to the entire membership of a group of which it is a member.
Horus further supports the model of v?rtual synchrony [BJ87a], so that message
deliveries and notification of group membership changes (i.e., changes to the group
mew) appear reliably and in a consistent order at all group members, even when
failures occur.
The security architecture makes the Horus programming model robust against
malicious attack, while leaving the model itself unchanged. First, during group
joins, the group and the joining process are mutually authenticated to one another.
More precisely, the group members are informed of the site from which the process
is attempting to join, as well as the owner of the process according to that site. Any
effort by an intruder to replay a previous join sequence or to forge the apparent
site from which a request is sent will be detected. And, the joining process knows
that responses apparently from the group are actually from the group that it is
trying to join.
Second, a group member must explicitly grant each group join before the join
is allowed to proceed. If the group members elect to not admit the joiner, they can
deny the request, in which case the joiner will not be admitted. In particular, if the
requesting site is not trusted by the group members to have properly authenticated
the owner of the process requesting to join, the members may choose to deny the
request.
62
Third, the integrity of these mechanisms and the Horus abstractions, as well as
the authenticity of group communication, are guaranteed within each group that
has admitted no processes on corrupted sites, i.e., sites at which an intruder has
tampered with the hardware or operating system. (The requirement that a group
not admit processes on corrupt sitesdoes not imply that member processesneed not
be trusted, as an untrustworthy process could admit a corrupted site to the group.)
Secrecy can also be requested by group members, in which case their messages will
be encrypted before being sent, in an effort to prevent their disclosure to a network
intruder. Secure point-to-point communication is also supported both within and
outside of groups, although here we discuss only secure group communication.
The programming model thus presented to the programmer is one in which
each process group can be viewed as a "fortress", where admission to this fortress
is regulated by the group members themselves (see Figure 4.la). Provided that the
members admit only processes on trustworthy sites, the normal Horus semantics
are guaranteed within the group. Moreover, these guarantees are achieved with
minimal changes to the process group interface. So, tools and applications designed
for the Horns interfaces should be able to be ported to secure groups easily.
A setting to which this style of secure group is particularly well suited is one in
which a fault-tolerant service must be provided to a larger, untrustworthy system
against which the service must protect itself. Such a service could be composed of a
single secure group located on a small "island" of trustworthy sites. Alternatively,
in a larger service where greater internal control is also required, the service could
be implemented using many secure groups, arranged to enforce security policies
within the service and to limit the damage to the overall service from the corruption
of a site (see Figure 4.lb). While the groups could span sites with different levels
of trustworthiness, each group is only as secure as the least secure site or process
it contains.
63
= process;			= secure group)
(a) A process requesting to join is au-
thenticated to the group members, who
either grant or deny the request. Inside
the group, communication is protected
cryptographically from active, and if re-
quested, passive network attacks.
insecure sites
(b) Applications can be built from many
secure groups to enforce internal security
policies and to limit damage from site cor-
ruptions. A group can span sites secured
to different levels, but is only as secure as
the least secure site admitted.
Figure 4.1: Secure process groups
4.3 Implementation
The security architecture as implemented in Horus is shown in Figure 4.2. On
each site, the core Horus functionality is implemented in a transport layer entity
called MUTS and a session layer entity called vSYNc, both of which will reside in
the operating system kernel on most platforms [vRBC+92]. The purpose of MUTS
is to provide reliable, sequenced multicast among sites; VSYNC then implements
the process group and virtual synchrony abstractions over this service. Horus will
also provide user-level libraries and tools to facilitate primary-backup computa-
tions, replicated data, etc. While security mechanisms are included at all levels of
the system, the core functionality of the security architecture exists in MUTS and
V5YNO. These layers have been augmented to distribute and use group keys.
64
Autflentication
Service
Security Services			ser Address 5 ace
Time			Group Key			User Processes
Service			Service			Access
Control			Toolkit :
Figure 4.2: The Horus security architecture
4.3.1 Group keys
Group keys form the foundation of security within each process group. The group
keys are a pair of cryptographic keys that are replicated at each site in the group.
While the security dangers of replication also apply to group keys, we view this
replication as acceptable at this level of the system for two reasons. First, the
user has complete control over where the group members, and thus the group
keys, reside. This gives the user the opportunity to determine a prudent degree
of replication for each group, depending on the nature of the application and the
environment in which it will run. Second, the damage resulting from imprudent
replication of a group's keys is limited, because the disclosure of a group's keys
corrupts only that group, not the entire system. Group keys are managed in the
VSYNC layer and are never held by user processes.
The first of the group keys is a key to a symmetric, or shared-key, cipher. This
key is called the communication key of the group, because it is used by MUTS and
V5YNC LO UoHIIuull1ca?e securely with other group members. Group communication
is not protected under the communication key directly. Rather, the communica-
65
tion key is used by MUTS to establish authenticated connections within the group,
which preserve the authenticity and order of all internal group communication, and
by VSYNC to distribute keys for the encryption of user messages. This indirect use
of the communication key is done to limit the amount of communication protected
under, and thus the amount of exposure of, the communication key, which is in-
tended to be used for the entire lifetime of the group. In our implementation, the
communication key consists of two DES [DES77j keys (112 bits), and encryption
is performed with it using the triple encryption technique of [Tuc79].
The second of the group keys is an R?SA private key whose corresponding public
key is incorporated into the group address. This private key is called the ?dentifl-
eat?on key (or "group private key") of the group, because it is used to identify, or
authenticate, sites and processes in the group to outsiders that possess the group
address. Authentication of group members is necessary whenever a process needs
to communicate with a group member, e.g., because the group is providing a service
that the process desires. Again, identification keys are held by V5YNO, although
group members can obtain signatures on messages through a VSYNC interface. Like
the communication key, the identification key is intended to be used for the lifetime
of the group. In the present implementation, the size of group private keys is a
compile-time constant (we typically use 512-bit RSA moduli), but in the future we
intend to allow the user to choose from several sizes when each group is created.
The group keys for a group are created by a user-level service on the site where
the group is created. This service, called the ?roup key servzce, generates sets of
group keys in the background and caches them in VSYNC. This is done to remove
the costly generation of an RSA key pair from the critical path of group creations.
(With the implementation we presently use, generation of a 512-bit RSA modulus
on a 33MHz Sparc ELC usually costs at least several seconds.) When a local process
requests to create a group, VSYNC removes a set of group keys from its local cache
and associates them with the group. VSYNC also creates and returns the group's
66
address, which contains, among other things, the public key and a location hint for
the group (i.e., a transport address). Latency of a group creation is minimal unless
the VSYNC cache is empty, which could happen, e.g., if the creation is requested
shortly after the site is booted and before the group key service has produced a set
of group keys.
Because other processes must use this group address to contact the group, the
user process would typically distribute this address in some fashion. How this is
done is up to the process, which has available to it secure group and point-to-point
communication that it can use for this purpose. However, this distribution will
often occur through the Horus narne service, a user-level, replicated, distributed
service that implements a hierarchical name space, much like a file system. In this
case, the user process (communicating over an authenticated channel) registers the
group address under a name in the name space, from which other processes can read
the address. To prevent an intruder from altering the address (and thus the public
key) for a group, we have enhanced the name service to enforce access controls that
restrict what processes may write addresses to a name. A more detailed discussion
of the name service and access control is presented in [RBG92j.
We should note that this use of the name service requires that it be protected
from tampering.1 Since it plays a central role for many applications, it might be
appropriate to replicate it using the techniques of Chapter 2 or [Sch9OJ. However,
we have not yet done this in the present implementation, and in fact, the cost of
doing so may be prohibitive for many general purpose uses. We hope to explore
alternative implementations of name services to address these issues in the future.
We stress, however, that applications are not bound to use any particular name
service or even to name their groups at all. In addition, we permit multiple name
1The need to protect the name service can be reduced by storing signed group addresses
(much like certificates) in it. Our name service will support this possibility but will not require
it, because doing so implies that for each group, there must be some well-known principal(s)
trusted by all other principals to certify the group address of that group. Instead of requiring
this for all groups, we have chosen to leave this matter of policy up to each application.
67
services within our architecture, opening the possibility that highly secure name
services could exist side-by-side with less secure name services.
To minimize the risk of exposing a group's keys, only those sites that need
them i.e., the sites of current group members--Hshould possess them at any given
time. Thus, if a site leaves a group, it destroys its copy of the group keys for that
group in an effort to prevent the subsequent corruption of this site from disclosing
the group keys to an outsider. In the event of a site failure, we rely on the loss of
volatile storage to erase the keys from memory.
4.3.2 The group join protocol
Once a process has obtained the group address for a group, it can contact the
group directly to request a service, or it can request to join the group. In the
former case, the process can become a client of the group so that it can securely
and efficiently communicate with group members. Clients are a user-level notion
managed by user-level protocols; i.e., VSYNC does not support clients of groups,
except to provide generic secure channels over which clients and group members
communicate.
To join a group, however, VSYNC must be directly involved to obtain the group
keys. In preparation for group joins, each site A is booted with the public keys of
the authentication and time services of Chapter 3, and its own private key K?; the
authentication service has the corresponding public key KA?1. VSYNC periodically
synchronizes with the time service. It also obtains a certificate CERTA for its
site from the authentication service and refreshes this certificate periodically, well
before its value of U(t) (see Section 3.2.1) surpasses the certificate expiration time.
When a process on site A desires to join a group, it provides to VSYNC the
group's address. It can also specify a message m to be sent to the group members.
VSYNC then follows the protocol shown in Figures 4.3 and 4.4. In Figure 4.3,
encryption and signature with key K are denoted by [`]K and t?JK, respectively.
68
(A message is "signed" with a shared key K by encrypting a one-way hash of the
message with K [SG92].)
A H B: CERTA, ?G, T, U, [RP, RK]Kal, HRKlKA
B A: fA, N, CK e RP, [KG]cKlRK
A?B: INiRK
Figure 4.3: VSYNC protocol by which site A joins group G, containing site B
1. Site A creates a request containing
o+ a global identifier & for the group that is obtained from the group
address;
o+ a timestamp T equal to the site's current value of L(t) (see Section
3.2.1);
o+ the user-id U of the apparent owner of the requesting process (although
more generally U could be a representation of the user that the group
members could verify themselves);
o+ a new, secret 112-bit reply pad fiP and 56-bit DES reply key RK, both
encrypted under the public key of the group (i.e., [RP, RK]K?l) and
o+ the user-specified message m, encrypted under RK if the user process
so requested (e.g., because m is a password or capability for admission
to the group).2
2Long-term capabilities or passwords for admission to the group should not be protected
under RK, because otherwise admission to the group, and thus CK, could be obtained simply
by breaking the relatively weak key RK. If RK is needed to protect long-term information, it
should be of strength comparable to that of CK.
69
All of this information is signed with KA and preceded by A's certificate. A
sends this message to the group using the location hint in the group address.
Safety of this protocol does not rely on the hint being accurate. If it is in-
correct, however, the requesting process must obtain a more current location
hint for the group, e.g., through the name service mentioned in Section 4.3.1
For the rest of this discussion, we assume that A's message reaches a site in
the group.
2. When a (particular) site B in the group receives this message, it authenti-
cates the requesting site by checking the authentication service's signature
on the certificate, extracting the public key from the certificate, and then
checking the signature on the request. Moreover, it verifies the timeliness of
the message by comparing the timestamps in the request and the certificate
to its current value of U(t), as described in Sections 3.2.1 and 3.3.1, respec-
tively. If the message is determined to be valid, B uses the group private key
KG to decrypt RK and, if necessary, RK to decrypt m. Then, it delivers an
upcall to a local member, indicating the group G, site A, apparent owner U,
and message m.
3. The local member can take any measures including, e.g., executing protocols
with other group members or the requesting process, to determine whether
the request to join should be granted. In particular, if the member does not
trust A to have authenticated the owner of the requesting process properly,
then it should possibly not allow the process to join. Eventually it informs
VSYNC of its decision.
4. If access is granted, B returns the group keys to A as shown in Figure 4.3:
the group communication key CK is encrypted with the reply pad RP by
bitwise exclusive-or (?), and the group private key Kc is encrypted under
CK (which is done off the critical path of this protocol). B also includes the
70
y
L
There are two reasons that RP and RK are used to encrypt the group commu-
nication key and to sign B's response, respectively, versus using KA?1 and KG for
these operations. First, the cryptographic operations with J?P and RK are more
efficient than the corresponding operations with the RSA keys. Second, in the
event that A leaves the group and is later corrupted, the use of RP prevents this
corruption from disclosing CK. That is, if the group keys were communicated to
Figure 4.4: Overview of the VSYNC group join protocol
(b) In this case, access is granted. The
group keys are securely sent to A in mes-
sage 2, in parallel with a group synchro-
nization protocol. After the keys are ac-
knowledged (message 3), the new group
view is installed.
c
New view
installed
y
identity of A and a nonce N in the message, which it signs with RK
5. After A authenticates B's reply and notes its own identity in the message,
it acknowledges the keys by returning N signed with RK. Note that an
attempt by a network intruder to replay a message in place of B's reply will
be detected because RK is a new (i.e., "fresh") key.
Roe?;???st #M?
Key			-
distribution
protocol			-			-
Request
to join
Access
denied
(a) A process on site A requests to join
the group containing (processes on) sites
B, C, and D. A sends the request to
B, which delivers the request to its local
member. Here, the member denies access,
and B replies accordingly.
A			1
71
A encrypted under (only) KA?1, the corruption of A would reveal KA and thus CK
if the intruder had previously recorded the protocol by which A joined the group.
Using RP to encrypt CK prevents this because after the join protocol completes,
A and B destroy RP and any state that could be used to reconstruct it.
Two points are worth emphasizing about our use of the authentication and time
services. First, neither service is on the critical path of the group join protocol.
This is more notable in the case of the authentication service, because in most
systems, the authentication service (or, in [TA91,LABW92], the CDC) is on the
critical path of authentication protocols. Second, the transparency of replication in
the authentication service simplifies the protocol. If, e.g., state machine replication
were used, each site would need to maintain certificates from a majority of servers
to prepend to its requests. This would also result in a substantial computational
overhead for authenticating these certificates.
The latency of a group join in the present implementation, measured over
SunOS 4.1.1 on moderately loaded 33MHz Sparc ELC workstations, is approxi-
mately 2.26 seconds on average. This cost is independent of group size, except
for very large groups. Over 90% of this cost can be attributed to the modular
exponentiation routines of the (software) RSA implementation we used3: with the
modular exponentiation operations removed, the cost of a group join drops to
195ms on average. Clearly hardware support for modular exponentiation would be
of value for applications in which group membership is very dynamic. However,
experience with Isis (see [BSS9lJ) leads us to believe that most Horus applications
will employ largely static groups, and so we do not expect that most applications
will require such hardware to use the security architecture effectively.
3In these tests we used the C implementation of modular exponentiation provided with the
RSAREF toolkit, licensed free of charge by RSA Data Security, Inc. The RSAREF toolkit
was developed to support privacy-enhanced electronic mail, not interprocess communication
and faster software implementations of modular exponentiation are available from RSA Data
Security, Inc. and others. However, with most software implementations of which we are aware,
the cost of modular exponentiation would continue to be a limiting factor in the performance of
the group join protocol.
72
4.3.3 Secure group communication
As discussed at the beginning of this section, all communication within a secure
group is protected cryptographically from tampering by a network intruder. And,
if requested, group members' messages will be encrypted before being sent on
the network. The mechanisms for performing these functions are decoupled in
our system: verification of message authenticity is performed by MUTS, whereas
encryption to prevent the release of message contents is performed in vSYNc.
The decision to decouple the mechanisms for preserving authenticity and se-
crecy is supported by several factors. First, maximum benefit from the authenticity
mechanisms is achieved by incorporating them at the lowest layer of the system,
within MUTS. This enables the VSYNC protocols to rely on the abstractions pro-
vided by MUTS, because all MUTS messages, and thus the MUTS abstractions, are
protected from tampering by a network intruder. On the other hand, because only
user messages need be encrypted,4 the encryption mechanisms are incorporated at
a much higher level, in VSYNC. This enables the encryption mechanisms to exploit
the abstractions provided by the lower layers in its algorithms. Finally, decoupling
these mechanisms allows us to use faster algorithms for each.
The algorithms for preserving authenticity and secrecy both rely on the crypto-
graphic strength of a one-way hashfunct?on. Informally, a one-way hash function
f has the properties that it is computationally infeasible to produce two inputs mi
and m2 such that f(m1) = f(7n2) or to produce any input m such that f(m) = h
for a given, prespecified value h. In recent years, several fast one-way hash func-
tions have been proposed; our implementation presently offers the use of either
MD4 [Riv9lj or MD5 [RD91]. Both of these functions process inputs of arbitrary
length in 64-byte blocks, maintaining a 16-byte state between blocks, to produce a
16-byte hash value. MD5 is conjectured to be stronger than MD4, but in our tests
is also approximately 30% slower.
4We make 110 effort to address traffic analysis attacks [VK83j.
73
MUTS packet authentication. Messages, or more accurately, MUTS packets, are
authenticated on a per-connection basis. A MUTS connection is a logical data path
from one MUTS instance to others. Only the instance that opened the connection
can send data across it, although recipients on a connection also acknowledge
packets over that connection.
The first packet sent on a connection includes a connection key, encrypted
under the communication key of the group in which the connection is opened. (This
packet also contains a timestamp to allow the recipients to detect replay attacks, as
described in Section 3.2.1.) The connection key consists of a 16-byte initialization
vector (IV) and a suffix whose size is a compile-time constant (typically 16 bytes).
To send a subsequent packet on the connection, the sender initializes the inter-
nal state of the hash function f to IV and then applies f to the packet and the
suffix. (Alternatively, the IV can be replaced by a secret prefix that is processed by
f before the packet [Tsu92J.) The result is placed in the packet header, as shown
in Figure 4.5a. The recipient of a packet verifies it by copying the hash value out
of its header, clearing the hash field, applying f (initialized with the IV) to the
packet and the suffix, and comparing the result to that copied from the packet. If
the hash values match, then the recipient considers the packet authentic. Replays
of packets are detected using sequence numbers in the packet headers.
To our knowledge, this form of message authentication was first proposed in
[Tsu92], although a variation of this approach was employed in [GMD91] and a
similar use of one-way hash functions in authentication protocols was proposed in
[Gon89b]. Its security relies on the assumption that it is infeasible to determine the
initialization vector and the suffix from packets and their hash values, and that the
secrecy of the initialization vector and the suffix prevents a network intruder from
forging proper hash values on packets. A substantial advantage of this approach
over techniques that use encryption is speed, because fast one-way hash functions
are typically much faster than encryption functions (e.g., MD4 in software is over
74
0			packet			suffixi
2L
hash
value
packet
(a) A MUTS packet is "signed" by append-
ing to it a secret suffix and applying a one-
way hash function f, initialized with a se-
cret initialization vector (TV).
k3k2kl\?...c3c2ci
m?m?mi?
(b) VSYNC encrypts a message by XOR-
ing it with a key stream that is produced
by applying a one-way hash function f to
a counter. The counter is incremented be-
tween applications.
Figure 4.5: Cryptographic operations
three times as fast as the fastest software DES implementations [LABw92]).
VSYNC message encryption. Because MUTS assures authentic, sequenced mul-
ticast in gwups containing no corrupt sites, the encryption algorithm we employ
ill VSYNC to encrypt user messages need not attempt to			protect the authenticity
of those messages. This allows us to use faster encryption algorithms that are
generally not suitable for protecting message authenticity in open networks. The
form of cipher we have chosen is a synchronous stream cipher [DH79j.
In a stream cipher, the ciphertext of a message m			....... is obtained by
enciphering the i-th element m? with the i-th element kj of a secret key stream
....... A stream cipher is said to be synchronous if the key stream is generated
independently from the message stream. In our implementation, each m? and kj
is a single bit. The key stream is generated using the counter method of [DH79j:
16 bytes of the key stream are generated by applying a one-way hash function
75
f to a 16-byte integer counter, and the counter is then incremented before the
next application of f to obtain the next 16-bytes of the key stream (see Figure
4.5b). Provided that the counter is initialized to a secret, unpredictable value and
f is sufficiently strong, knowledge of one portion of the key stream will not reveal
other portions. That is, the key for producing ....... is the initial value of the
counter. The i-th bit Cj of the ciphertext is simply the exclusive-or of ki and m?
(i.e., Cj = kj ?
We use this cipher as follows. On each message installing a new group view
containing n sites, the VSYNC instance sending the message includes a list of n
new, unpredictable 16-byte integers encrypted under the communication key for
the group. Each site in the group decrypts this list and initializes n integer counters
to the n values in the list. The i-th site in the group encrypts a message to
the group using the key stream generated from the i-th counter. Similarly, sites
decrypt messages from the i-th site by exclusive-oring the next portion of the i-th
key stream against the received message.
One requirement in using a synchronous stream cipher is that the key streams
of the sender and receivers remain synchronized with one another. We have im-
plemented these encryption mechanisms at a level within V5YNC that allows u?
to exploit some process group semantics for this purpose. In particular, the level
at which encryption is performed within VSYNC ensures that a message encrypted
while the group is in one view will be decrypted at all recipients in the same view.
This makes view changes an appropriate time to reset the integer counter for each
site in the group to a new value, which should be done occasionally to make crypt-
analysis more difficult. In addition, the level at which encryption is performed also
ensures that messages from the same site are decrypted in the order they were
encrypted, which automatically keeps the key stream for each site synchronized at
all sites in the group between view changes.
An advantage of synchronous stream ciphers over other encryption methods is
76
that they allow much of the work for encryption to be done in the background. In
our system, we precompute and cache portions of the key stream for each site in
a group so that the key stream is immediately available for use. This reduces the
encryption of a message smaller than the cache to simply an exclusive-or over the
length of the message. The size of the cache for each site in a group is presently a
compile-time constant, although in the future we intend to make this size dynam-
ically adjustable.
Performance. Preliminary performance figures for group communication are
pictured in Figures 4.6 and 4.7. In each graph, the line labeled "N" indicates
performance when no security mechanisms are employed. The line labeled "A"
indicates performance when packets are authenticated by MUTS. The lines labeled
"E2" and "E5" indicate performance when packets are authenticated by MUTS and
the sender's message is encrypted by vSYNc. In the E2 tests, both the sender and
the receivers maintained a 2 kilobyte (kb) cache of precomputed bits for encryp-
tion and decryption; in the E5 tests, these caches were of size 5kb. These tests
used MD4 for f and were performed between user processes on moderately loaded
33MHz Sun 4c/25 Sparc ELC workstations running SunOS 4.1.1.
In Figure 4.6 are average latencies for member-to-group RPC interactions. In a
member-to-group RPC, one sender multicasts a single message to the membership
of the group, and all members acknowledge (with a null message). The latency is
the measured time at the sender between sending the initial message and receiving
all acknowledgements. Each data point in Figure 4.6 is the mean latency of 100
consecutive member-to-group RPCs performed as quickly as possible.
Figure 4.7 indicates member-to-group bandwidth, i.e., how much data can be
pushed from a single member to the other group members per second. Each data
point was obtained by performing a 1 megabyte (Mb) member-to-group RPC and
dividing 1Mb by the time required for this RPC to complete.
Two items are worth noting about the graphs in Figure 4.6. First, in part (a)
77
N = no security; A = authenticated only;
E2 = authenticated & encrypted, 2kb cache;
= authenticated & encrypted, 5kb cache
55			19-
45			17-
35			15-
ms			ms 13-
25
11-
15			9-
7-
A
N
I I
012345			1			2			34
message size (kb)			no. of recipients
(a) Latency as a function of message (b) Latency as a function of group
size. Group of size of two.			size. Message of size lkb.
Figure 4.6: Member-to-group RPC latencies (ms)
the rapid rise of E2 relative to E5 beginning after the 2kb message is due to the fact
that in the E2 tests, the sender and receiver each precomputed only 2kb of the key
stream. So, while their cache contents were sufficient to encrypt and decrypt the
2kb message, the 3kb message exhausted their caches and forced both to generate
parts of the key stream before sending or delivering the message. Similarly, in
Figure 4.7 the large impact on bandwidth due to encryption is partly due to the
immediate exhaustion of the sender's and receivers' caches when encrypting and
decrypting very large (in these tests, 1Mb) messages.
The second item of interest regards the graph in part (b) of Figure 4.6. While
not surprising, it is still interesting to note that group size has virtually no effect on
the cost of message encryption or packet authentication. The increase in latency
as a function of group size is primarily a result of sending the message to increas-
78
650
550
450
kb/s 350
250
150
50			E2
1			2			34
no. of recipients
N, A, and E2 are the same as in Figure 4.6.
Figure 4.7: Member-to-group bandwidth (kb/s)
ingly many destinations, and if hardware multicast were available, this increase
should virtually disappear. At the time of this writing, however, we have not yet
experimented with hardware multicast.
In addition to hardware multicast, we plan to pursue other optimizations to
the performance of group communication. We are continuing to optimize the cryp-
tographic mechanisms of our implementation. Moreover, efforts will begin soon to
incorporate flow control mechanisms into MUTS to improve performance. At the
present time, MUTS provides little flow control, and so in the tests of Figures 4.6
and 4.7, we had to use small packet sizes and frequent acknowledgements to pre-
vent packets from being dropped by the operating system. We anticipate that flow
control mechanisms will improve the performance of MUTS transport substantially.
Another important aspect of our effort to achieve better performance is integrating
Horus into microkernel-based operating systems.
79
4.4 Summary and discussion
In this chapter we have presented our security architecture for fault-tolerant sys-
tems. An integral part of this system is an authentication service and a time service,
described in Chapter 3, that securely and fault-tolerantly support key distribution.
Together these services are used to support a secure process group abstraction.
This abstraction provides a means to replicate applications in a protected fash-
ion. Authentication of group members and protection of group communication are
achieved through the use of group keys. These are distributed to processes during
the group join protocol, which employs the authentication and time services of
Chapter 3. The mechanisms for ensuring the authenticity and secrecy of group
communication have been decoupled to allow the use of faster algorithms for each.
Preliminary performance results are encouraging.
By integrating our architecture into the Horus system, we have secured Horus'
virtually synchronous process group abstraction. In addition, changes to the Horus
process group interfaces due to the security mechanisms are minimal, consisting
only of additional routines to grant or deny join requests and additional options to
some routines to indicate when communication should be encrypted. As a result,
applications and group programming toolkits designed for the Horus interfaces
should be able to be ported easily to work over secure groups and thenceforth
can be relied upon in a secure group that has not admitted a corrupt site or pro-
cess. Such toolkits planned for Horus will facilitate primary-backup computations,
data replication, automatic synchronization among group members, client-server
computations, and so forth.
4.4.1 Status of the system
The Horus system, including the security architecture described in this chapter,
is intended to be released to the public later this year. At the time of this writ-
ing, all of the mechanisms described in Chapter 3 and Sections 4.2 and 4.3 have
80
been fully implemented, with the exception of the name service discussed briefly in
Section 4.3.1. We presently have a preliminary name service running with limited
functionality. A name service with the described functionality is intended for de-
velopment in the near future. In addition to working on this and other components
of the system not described here, we are working to improve the performance of
the system as discussed in Section 4.3.3.
4.4.2 Related work
To our knowledge, consideration of security issues unique to group-oriented systems
first occurred in the design of the V kernel [CZ85j. V supports a notion of process
groups, but with weaker semantics than that of Horus. While V is not a security
kernel and does not support, e.g., key distribution or secure communication, V
does make efforts to restrict group membership for security reasons. In principle,
the security architecture proposed here could be integrated with V to further these
efforts, although not without changes to some algorithms for using group keys.
There have also been attempts to address security issues in systems that sup-
port fault-tolerant computing using approaches other than process groups. One
example is the Strongbox extension to the Camelot distributed transaction process-
ing facility [TY9t]. Strongbox provides mechanisms for mutually authenticating
clients and servers and for encrypting communication between them.
4.4.3 Future work
There are several issues that we have not attempted to address in our architecture
but that should be addressed in systems in which this technology is employed; some
of these may be pursued in the future. Examples include securely booting sites
and user authentication. To aid in booting sites we have built a boot server, which
provides to each site its initial keys using a secret boot key that must be provided to
the site by an operator. (Alternatively, these initial keys can be stored on the site
81
in non-volatile memory [LABW92].) However, we do not take measures to verify
the integrity of the operating system or Horus when the machine is booted; for one
approach to doing this, see [LABW92J. We view these issues, and other intra-node
issues such as protected address spaces and local inter-process communication, as
more general operating system security issues that have not been goals of this work.
We have also not attempted to address the issue of user authentication. How-
ever, the mechanisms described in this paper facilitate several solutions to this
problem. For instance, the authentication service of Chapter 3 and our boot ser-
vice could easily be extended to provide certificates for users and users' private keys,
respectively, similar to the Certificate Distribution Centers and Login Enrollment
Agent Facility of SPX [TA91]. The secure channels provided by our architecture
also facilitate simpler password-based user authentication mechanisms.
Chapter 5
Preventing denial and forgery of
causal relationships*
5.1 Introduction
In a distributed system, it is often important to detect the causal relationships
between events, where event e1 is causally before event e2 if e1 happened before
e2 and could possibly have affected the occurrence of e2 [Lam78b]. Causality has
been recognized as fundamental to distributed computing and forms the basis for
event orderings in many distributed systems and distributed service implementa-
tions (e.g., [PBS89,HW90, Sch90, BSS91,LLSG92]). For instance, several systems
implement communication primitives that deliver messages in an order consistent
with the causal relationships among the messages (i.e., among the events in which
the messages were sent). This causal ordercan be seen as an extension of FIFO
order to a setting with multiple senders and receivers, and is especially useful in
systems that exploit asynchronous communication for performance [BCG91].
In Chapter 2 we illustrated how the corruption of a server in a replicated service
*oc 1993 IEEE. Reprinted, with additions and minor changes, from: M. K. Reiter and L. Gong.
Preventing denial and forgery of causal relationships in distributed systems. In Proceedin?s of
the 1993 IFEE Symposium on Research in Security and Privacy, May 1993, pp. 30--H40.
82
83
could lead to a violation ofcausal order in the sequence ofrequests processed by
the correct servers; the result was aviolation ofsecurity policy. In this chapter we
further explore the importance of detecting causal relationships to security. We
argue that detecting causalrelationships amongevents can be important forsecu-
rity in awide range ofsettings, in that it may be essential to the implementation
ofa security policy that a process be able to determine iftwo events are causally
related, and ifso, how.
Theviewofcausalitythat we takeisvery different from that taken by previous
treatments of causality in the security literature. Previous studies of causality
and security have occurred in the context of multi-level information flow, where
one goal is, informally, to prevent events at higher-level objects from causally
preceding events at, and thus carrying information to, lower-level objects. That
is, in previous works, causal relationships have been viewed as something to be
avoided in order to achieve noninterftrence [GM82].
In contrast, we claim that because ofthe fundamental role ofcausality in dis-
tributed systems, the accurate detection (but not elimination) ofcausal relation-
ships can be crucial to security in distributed systems. This was first illustrated
in [RBG92] by the following example of "insider trading," avariation ofwhich we
also employed in Chapter 2; suppose that a trader issues a request to a trading
service to purchase shares ofstock, and then as a result ofan (indirect or direct)
interaction with another trader, the other trader infers that this request has been
made. Ifthelatter trader is ableto submit arequest tothe tradingservicein such
a way that the two requests appear to be concurrent, the service could be fooled
into processingthe latter trader's request first. The result could be, e.g., that the
request ofthe latter trader could increasethe apparent demand forthe stock, and
thus the price offered to the former trader. To prevent this insider trading, the
tradingservicemust recognizethat the request ofthelattertraderiscausallyafter
that ofthe former, and should process that ofthe former first.
84
As another example of the importance of causality detection to security, con-
sider a scenario in which a company announces to the trading network that it is
merging with another company. Suppose that a trader with inside information of
this merger requested to buy large quantities of the company's stock prior to the
announcement but, to avoid suspicion, attempted to make it appear that the re-
quest was initiated causally after the announcement. If the trading service accepts
that the request was initiated causally after the announcement, then the insider
trading may go undetected.
More generally, because of the fundamental importance of causality to so many
distributed algorithms, the conversion of these algorithms for use in a hostile en-
vironment necessarily relies upon the accurate detection of causal relationships
despite malicious behavior. The above examples show that the type of causality
detection required to implement a security policy can differ from one policy to
the next. As illustrated in the first trading example above, a security policy may
require that if a causal relationship exists, then it is detected. On the other hand,
in the second example, security relies on an inverse requirement, namely that if
a causal relationship is detected, then it should actually exist. Thus, depending
on the security policy, it may be important that a principal not be able to deny
existing causal relationships or to claim nonexistent ones without being detected.
In this chapter we formalize possible security goals with respect to causality and
present simple algorithms to attain these goals in some situations. This work is a
major generalization and improvement of the discussion of causality in [RBG92j,
in two ways. First, this work presents a general framework in which attacks on
causality can be examined; in this framework, we were able to identify attacks
that are not considered in [RBG92]. Second, we present new algorithms to counter
these attacks.
The rest of this chapter is structured as follows. In Section 5.2, we describe the
assumptions that we make about the system. In Section 5.3, we formally define
85
the notion of causality. In Section 5.4 we formalize our security goals with respect
to causality. In Sections 5.5 and 5.6 we describe several algoritlims for reaching
these goals. We summarize and describe related and future work in Section 5.7.
5.2 The system model
We assume a system consisting of a set ? = ....... , PnJ of pwcesses that are
spatially separated and that communicate exclusively via a completely connected,
point-to-point network.1 We often denote processes with the letters P, Q, R and S
when subscripts are unnecessary. Processes that behave according to their design
specification are said to be co?rect Processes may be cor?upted and thereafter
may exhibit arbitrarily malicious behavior, limited only by the assumptions stated
below.
The execution of each process is modeled as a sequence of indivisible events.
There are two types of events that can be executed by processes: sending a message
m to a process, denoted by send(m), and receiving a message m from a process,
denoted by receive(m). (Internal computations are not explicitly modeled.) Mes-
sages are identified by their send events and not their contents; e.g., messages with
the same contents sent in different events are different messages for our purposes.
We assume that each process receives only messages that are sent to it (or by it;
see below). In particular, communication channels between correct processes are
authenticated and protect the integrity and the secrecy of communication, so that
corrupt processes cannot tamper with or receive this communication. In addition,
all communication between corrupt processes is modeled with explicit sends and
receives, regardless of its actual form (e.g., signals via a covert channel). We also
assume that channels between correct processes provide FIFO delivery using, e.g.,
a standard sequence number mechanism [vK83j.
1The results of this chapter can be extended for multicast communication, although multicast
complicates the algorithms and discussion with little benefit. Thus, for simplicity we treat only
point-to-point communication here.
86
Many algoritlims for detecting causality in benign environments utilize assump-
tions of synchronized clocks or bounded message transmission delays (e.g., [Sch9O]).
However, we do not assume that correct processes maintain synchronized clocks,
or that message transmission times between correct processes or execution speeds
of correct processes are bounded. That is, the system is totally asunchwnous.
Finally, to simplify the following discussion, it is convenient to stipulate that
at each process, the event send(m) is immediately followed by receive(m), with no
other events occurring between these two. So, a message is received by its sender
and (possibly) by its intended destination.
5.3 Definition of causality
We use the notion of causality formulated by Lamport in [Lam78b]. As described
in Section 5.1, one event is causally before another if it could have affected that
other event. More precisely, suppose we define the "one-step" causality relation
as the smallest relation satisfying the following conditions:
1. If events e1 and e2 are executed consecutively at the same process, then
el			e2.
2.			For any m, send(m)			receive(m).
Then, the causality relation is simply the transitive closure of ?.
In this chapter, we will be concerned with causal relationships among messages,
where two messages are causally related precisely as the events in which they were
sent. So, if send(m1) send(m2), then we say that m1 is causally before m2 and
m2 is causally after mi. We will often use "m1 H m2" as an abbreviation for
"send(m1) H send(m2)"
Finally, we define a causal chazn to be a sequence of events el, .2.... ,ei such
that el			e2			...			el. Note that el			e2 if and only if there exists a causal
chain beginning with ei and ending with ?2
87
5.4 Causal security goals
In Section 5.1, we discussed several examples in which the detection of causal rela-
tionships was important for security. In this section we attempt to more carefully
formulate security goals with respect to causality. We introduce two notions, denial
and forgery, that capture the ways in which efforts to detect causal relationships
between messages can fail due to malicious or accidental behavior, and discuss
how these notions relate to the examples of Section 5.1. Sections 5.5 and 5.6 are
devoted to preventing denial and forgery, respectively.
Since there is a version of denial and forgery for each causality detection algo-
rithm, when defining these notions it is convenient to abstract all such algorithms
as a predicate C on pairs of messages. That is, we assume that a process deter-
mines if message mi is causally before message m2 by evaluating C(mi, m?). If
C(m1, m?) evaluates to true, then the process "believes" that m1 m2; otherwise,
it "believes" that mi 74 m2, where 74 is the complement of H Thus, C has the
following desired behavior:
C(mi, m2) =			true			if mi m2,
false			otherwise.
A correct process P generally need not be able to evaluate C on all pairs of messages,
but should be able to compute C(mi, m2) if both receive(m1) and receive(m2) are
executed at P. (Recall that if a process executes send(rn), then it also executes
receive(m).) In the remainder of this chapter, we will concern ourselves with
predicates of this form only.
Given C, we can now define the notions of denial and forgery, which can occur
due to malicious or accidental behavior, if c is not robust to such behavior.
Denial: A causal relationship is denied (with respect to C) if there exist messages
mi and m2 such that m1 m2, but at some correct process C(m1, m2) is
false.
88
Forgery: Acausalrelationshipisfo?ed(withrespecttoC)ifthereexist messages
m1 and m2 such that mj 74 m2, but at some correct process C(m1,m?) is
true.
We have already seen examples of how denial and forgery can result in security
problems. For instance, reconsider the trading examples in Section 5.1, which are
represented pictorially in Figure 5.1. In the first example, the second trader Q
attempts to deny that its request m2 is causally after P's request m1 as a result
of its interacting with F (possibly through other processes s). If the attempt is
successful, the trading service R may fail to recognize that m1 should be serviced
before m2. The second example illustrates the dangers of forgery: the trading
service R should not interpret the request ?2 from the trader Q to be causally
after the announcement mi from the company P when in reality it is not.
C(m1,m2) =
Process R detects causal relationships be-
tween messages with the predicate C.
Figure 5.1: Causality detection
The next two sections of this chapter are devoted to finding algorithms to
prevent denial or forgery in various situations. In general, to prevent denial it
must be the case that
89
D: If mi m2, then C(mi, m?) is true at any correct recipient of m1 and m2.
On the other hand, the prevention of forgery requires that precisely the converse
hold:
F: If C(mi, m?) is true at any correct recipient of m1 and m2, then m1 m2.
In order to rule out trivial solutions that provide no causal information, we also
require that our algorithms satisfy the following property in addition to preventing
denial and/or forgery:
Elf there exists a causal chain e1,... , ej such that el = send(m1), e? = send(m2),
and for each j ? ?l,. .. , 1?, e? was executed at a correct process, then at any
correct recipient of m1 and m2, C(rn1, m?) is true and C(m?, mi) is false.
Property E requires that a causal chain traversing only correct processes be rec-
ognized. E serves to rule out some trivial algorithms that provide no causal in-
formation, such as "C(m1, m?) = false for all mi and m2" (which satisfies F) and
"C(m1, m?) = true for all rn1 and m2" (which satisfies D).
In Sections 5.5 and 5.6, we concentrate on finding algorithms to satisfy E al-
ways, and D or F if the sender of m1 (in the statement of D and F) is correct.
In Section 5.5, we present two algorithms that satisfy E and that satisfy D if the
sender of mi is correct. Then, in Section 5.6, we present two algorithms that
satisfy E and that satisfy F if the sender of mi is correct. What can be done to
satisfy D and/or F when the sender of m1 is corrupt is an open problem. However,
the algorithm in Section 5.6.2 also satisfies a property with only a slightly weaker
consequent than F, even if both the senders of mi and m2 are corrupt. We suspect
that this property may suffice in some situations.
5.5 Preventing denial
In this section we discuss two methods for preventing denial attacks. More pre-
cisely, the algorithms discussed in this section ensure that if a correct process R
90
receives messages mi and m2, where the sender of mi is correct and mi m2
then C(mi, m?) is true when evaluated at R. So, in the example of Figure 5.1,
these protocols ensure that if mi is causally before m2, then Q cannot "backdate"
m2 to appear causally before or concurrent with mi.
5.5.1 The causality server
Our first solution employs a trusted causality serven Intuitively, the causality
server acts as an intermediary between all pairs of processes in the system. Each
correct process directly communicates with (i.e., sends messages to or receives
messages from) only the causality server, via an authenticated, FIFO channel that
protects the integrity and secrecy of communication. For one process to send a
message to another process, the former sends it to the causality server. For each
process R, the causality server forwards messages destined for 1? to R, in the order
in which the server receives those messages (see Figure 5.2).
The causality server CS forwards messages
destined for a process to that process in the
order in which it receives those messages.
Figure 5.2: The causality server CS
This simple causality server ensures that if processes detect causal relationships
with the predicate
91
true if m1 is received before m2,
C(mi,m?) =
false otherwise,
then it is not possible for a corrupt process to deny the causal relationships that
its messages have with causally prior messages from correct processes.
Theorem 5.1 This algorithm satisfies E and satisfies D if the sender of m1 is
correct.
Proof.
D: Suppose there are messages m1 and m2 such that mi H m2 and the sender
of m1 is correct. Also suppose that R is a (correct) recipient of m1 and m2.
If R is the sender of m1 (i.e., R sent mi to another process), then because m1
is received at R immediately after it is sent, R receives m1 before m2. Now
suppose some other process sends mi to R. Because the channel from the
sender of mi to the causality server is FIFO, mi must arrive at the causality
server before any message m such that m1 H m. So, m1 is forwarded to
(and thus is received by) R before any such message destined for R, and in
particular, before m2.
E: Suppose there exists a causal chain el...., ei such that el = send(m1),
e1 = send(rn2), and for each j ? ?1,..., l?, e? is executed at a correct process.
By the argument for D, C(m1, m?) is true at any correct recipient of m1 and
m2. Then, because if nil is received before m2 then ?2 is received after nil,
C(m2,m1) is false.
A warranted concern with the use of a causality server is performance: this
scheme results in twice as many messages being transmitted over the network than
without the causality server, and the server may become a traffic bottleneck in the
92
system. However, the degree to which a causality server would become a bottleneck
might be less than at first expected, because the causality server has very little
processing to do on each message it receives and forwards. In fact, in a likely
implementation it would simply need to decrypt the message, appropriately check
and attach channel sequence numbers (to implement FIFO order), re-encrypt the
message, and forward it. Supposing that encryption and decryption can be done
in hardware, the performance impact seen by processes could be tolerable.
A second concern with this scheme is that it introduces a single point of failure,
namely the causality server, into the system, because all communication would
cease if the causality server failed. This problem can be addressed using known
replication techniques (e.g., the techniques of Chapter 2 or [Sch9Oj), albeit at an
additional cost to performance.
5.5.2 The conservative approach
An alternative approach to the use of a causality server is for each process P to
delay sending a message to its destination until all messages that P previously
sent to other destinations have been received at those destinations.2 In general, a
sender can be informed of the receipt of its messages by acknowledgements. These
acknowledgements would occur as part of a lower layer protocol, and would not
result in additional process events or be delayed like messages.3 Processes again
detect causal relationships with the predicate
C(rn1, m?) =			true			if rnl is received before rn2,
false			otherwise.
Theorem 5.2 This algorithm satisfies E and satisfies D if the sender of mi is
correct.
2A further condition is required if multicast communication is used (see [BSS9l]). However,
as stated in Section 5.2, we restrict ourselves in this chapter to point-to-point communication.
3These acknowledgements could be viewed as introducing additional causal relationships.
However, since acknowledgements carry no application-specific information, these relationships
are unlikely to be of interest in most settings and thus are omitted from the present discussion.
93
Prnof
D: Suppose there are messages mi and m2 such that m1 rn2 and the sender
of mi is correct. Also suppose that R is a (correct) recipient of mj and m2.
If R is the sender of mi (i.e., R sent mi to another process), then because mi
is received at R immediately after it is sent, R receives m1 before m2. Now
suppose some other process sends m1 to R. If the same process also sends
m2, then R receives mi first because channels between correct processes are
FIFO. Otherwise, m2 can be sent only after mi is received at R, because the
sender of m1 does not communicate to destinations other than R untll R has
received mi.
E: Suppose there exists a causal chain e1,. .. , ej such that el = send(m1),
= send(m2), and for each j ? ?1,..., 1?, Cj is executed at a correct process.
By the argument for D, C(mi, m?) is true at any correct recipient of mi and
m2. Then, because if m1 is received before m2 then m2 is received after mi,
C(m2, mi) is false.
This approach, sometimes called the conservatzve approach, has been used by
several systems to detect causal relationships in benign environments (e.g., [Sch9O,
BSS91]). It is especially attractive in our setting because a correct process can
singlehandedly prevent corrupt processes from "backdating" their messages to
wrongly appear causally prior to or concurrent with its own. That is, it need
not rely on a third party for this guarantee. Moreover, this solution introduces no
bottleneck or single point of failure into the system.
Communication performance achieved with the conservative approach can vary
widely, depending on the particular communication patterns exhibited by pro-
cesses. Because a process delays sending a message to a destination only when it
does not know of the receipt of a message it previously sent to a different destina-
94
tion, processes can achieve the full performance benefits of asynchronous communi-
cation when streaming messages to a single destination. However, when processes
send to many different destinations in quick succession, the communications are
essentially reduced to synchronous remote procedure calls.
From a security point of view, the most significant disadvantage of the conser-
vative protocol is the potential for denial-of-service attacks. A corrupt process can
prevent a sender of a message from being able to send to any other destinations by
refusing to acknowledge any messages sent to it. (This form of "attack" can occur
even in benign environments if a process simply crashes.) Different policies can be
implemented to deal with this problem, and which is best depends on the partic-
ular system and application. One approach is implemented in the Horus system,
which uses a version of the conservative protocol adapted for multicast commu-
nication [BSS9l, RBG92]. In Horus, a trusted, fault-tolerant service called the
failure detector declares processes faulty when they appear so, thus removing them
from the system view [RB91,R?ic92].4 The result is that a process that attempts
denial-of-service attacks by refusing to acknowledge messages will eventually be
considered faulty and ignored by all correct processes in the system. In particular,
any process waiting for acknowledgements from such a process would be allowed to
proceed with sending to other processes without jeopardizing causality detection.
5.6 Preventing forgery
In this section we present two algorithms that satisfy F if the sender of m1 is
correct. That is, they ensure that if a correct process R receives m1 and m2, the
sender of mi is correct, and mi 74 m2, then C(m1, m2) is false when evaluated
4Actually, the use of the failure detector in the secure version of Horus is not this simple. Its
fault-tolerance requirements and need to be protected make it subject to the tradeoff between
security and availability discussed in Chapter 1. Moreover, it is not directly amenable to the
techniques of Chapter 2. Thus, the use of the membership protocol of [Ric92] on a per process
group basis might be preferable to avoid the reliance of all groups on a single failure detection
service. A membership protocol that is tolerant of corruptions and that could facilitate a single,
system-wide failure detector is a topic of ongoing research (see Section 2.6.1).
95
at R. As discussed in Section 5.4, satisfying F under only the assumption that
the sender of m2 is correct is an open problem. However, the second algorithm
presented here does satisfy a property with only a slightly weaker consequent than
F, even if both the senders of mi and m2 are corrupt. We believe that especially
in the case in which the sender of m2 is correct, this property may suffice for some
applications.
These algorithms use a digital signature scheme. We assume that each process
Pj holds a private key Kj with which it can sign information so that any other
process can verify the information's origin and authenticity. Information m so
signed is denoted fmlKj
5.6.1 Signed vec;tor timestamps
Our first algorithm originates from a technique introduced in Lamport's paper on
causality [Lam78b], where he described an algorithm using logical clocks to detect
causal relationships among messages (in benign environments). In his technique,
each process Pj maintains a logical clock tj that assigns a value t?[e] to each event e
executed at Pj, according to the following constraint known as the clock condition:
Ti: For any events e1 and e2, if el e2, then t?[e1] ? t?[e2].
(The notation "t?[e]" implies that Pj executed c.)
In Lamport's algorithm, each logical clock tj was implemented by an integer
counter and "?" was normal integer less-than (<); thus, it was not possible to at-
tain the converse of the clock condition, as well. Later, however, several researchers
(e.g., [Mat89]) extended the notion of logical clocks to that of vector clocks and
defined a new relation "?" on them so that the converse condition could also be
satisfied:
T2: For any events e1 and e2, if t?[e1] ? tj[e2], then el e2.
96
In the algorithm in [Mat89], each process Pj maintains a vector clock tj =
(tt1?, t2j, , t??, where n is the total number of processes in the system and for each
k ? [1,. , n?, t2k. is a nonnegative integer. Vector clock values t --H (t1 ,
and t = (t?,... , are ordered according to the following relation: t ? t iff for all
k E ?1,. .. ,nJ, tk ? t?, and there exIsts a k E [1,... ,?1 such that tk < t?. The
algorithm to satisfy Ti and T2 is as follows:
1. When process Pj begins execution, tj is initialized to all zeroes.
2. Process Pj increments tt; before executing each event.
3. If send(m) is executed by process Pi, then the timestamp Tm = ti is sent
with m ti[send(m)] is defined to be tj.
4. If receive(m) is executed by process Pj, then for all k ? [1,..., n?, Pj sets
t3k. to max[tjk.,Tmky, where Tmk is the k-th component of Tm. t5[receive(m)] is
then defined to be tj.
Because the timestamp on a message m sent by Pj is Tm = t?[send(m)] (by
step 3), this algorithm can be seen as using the following predicate to determine
the causal relationship between two messages ?i and rn?:
C(m1, m?) =			true			if Tm1 ? Tm2,
false			otherwise.
In our system model, this algorithm does not suffice to prevent processes from
forging causal relationships, because a corrupt process can easily manipulate com-
ponents of vector timestamps. For instance, in Figure 5.1, Q could easily fabricate
a timestamp Tm2 to make m2 wrongly appear causally after m1.
We thus propose a technique to prevent this. In our approach, processes main-
tain vector clocks as before. However, each process Pj digitally signs the i-th
component of each timestamp it includes with a message, and this signed value is
then propagated by other processes in the i-th components of the timestamps they
97
include with their messages. So, when a process Pj executes send(m), it includes
with m a vector timestamp of the form
Tm = (?41Kl,ft%?1K2,.. ?ttt?1Kn'),
where for each k # i, tt%klKk was received by Pj in a previous receive event. The
requirement that each (nonzero) component of a vector timestamp be signed by
the corresponding process prevents corrupt processes from inflating components of
correct processes.
More precisely, the algorithm executes as follows:
1. When process Pj begins execution, tj is initialized to all zeroes.
2. Process Pj increments t? before executing each event.
3. If send(m) is executed by process Pj, then the timestamp Tm = .......,
is sent with m, where for each k E f1,..., ni,
Tmk =			0			iftk? --H0,
?tki1Kk			otherwise.
4. If receive(m) is executed by process 1), and for all k E f1,..., ?i, Tmk is
properly signed by ?k or is zero, then for all k ? ?1,..., ni, 1) sets t3k. to
maxft3k.,Tmk?, where Tk --H 0 if Tmk --H 0, and Tmk = fTmk?K? otherwise. Then,
for each k ? f1,...,n? such that t3k > 0, I) saves ?tJk1Kk, which it either
received as Tmk or already had prior to this event.
If some nonzero Tmkis not properly signed by ?k,then because communication
channels between correct processes protect the integrity of communication,
this message must be from a corrupt process and is therefore ignored.
Note that each Tmk can always be computed by a correct process Pj in step
3 of this algorithm, because if k # i and tkj $ 0, then by step 4, Tmk = fttklKk
98
was received and saved by Pj in a previous receive event. Processes detect causal
relationships between messages with the same predicate as before, adjusted for the
signatures:
true			ifTrn1 ? Tm2 andVk c ?1,...,ni,each
C(m1,m2)=			ofTrnkiandTrnk2 is signed by Pk or isO,
false			otherwise,
where Tm = (i???,. . .
Theorem 5.3 This algorithm satisfies E and satisfies F if the sender of mi is
correct.
Proof.
E: Suppose there is a causal chain e1...., e1 such that el = send(m1), e1 =
send(m2), and for each j E f1,..., l?, e? is executed at a correct process.
By construction, each component of Tm1 and Tm2 is properly signed or zero,
and Vk ? 11,...,nl, Tmki < Tmk2 by step 4. Moreover, if the sender of m2
is Pj, then Tm?i ? Tm?2 by step 2. So, by the definition of "?" for vector
timestamps, Tmi ? Tm2 and Tm2 $ Tmi.
F: Suppose that a correct process R receives mi and m2, where the sender
Pj of mi is correct, and that C(m1,m2) is true at R. Then, ?1 ? Th2
Moreover, by step 2 of the algorithm, Tm?i > 0, and so Tm?2 must be signed
by Pj. If Pj sent m2, then mi H m2 because T& <Tm?2 implies that Pj sent
m2 after mi. If another process sent m2, then there must be a causal chain
by which Tm?2 traveled from Fj to the sender of m2. Because Pj released
only with mi or a causally subsequent message, it follows that mi H m2.
In this algorithm, if Pj is correct, then corrupt processes cannot inflate times-
tamps' i-th components above their proper values, because the signatures for the
99
inflated values are not predictable before Pj releases them. Thus, this technique
is similar to the use of nonce identifiers [NS78], in that causal relationships are
established by the presence of "new," unpredictable, and verifiable values (i.e., the
signed components) in messages. However, our algorithm is more powerful because
any process can verify each value, and not just the process that issued it. This
technique also has other beneficial features; in particular, it requires no centralized
servers, and communication can proceed completely asynchronously.
The primary weakness of this algorithm is its ability to scale. As n becomes
large, signed vector timestamps could consume significant network bandwidth.
Techniques similar to some of those described in [BSS9i] for compressing times-
tamps in benign systems are appropriate for use in our system model but will not
be discussed here. A second threat to scale is that the cost of computing and
verifying signatures could be significant if n is large. However, a signature scheme
with a fast verification algorithm could lessen this burden, because in this use,
signatures will typically be verified more frequently than they are created.
5.6.2 The piggybacking algorithm
Our second algorithm for satisfying F if the sender of mi is correct is based on a
piggybacking technique that, to our knowledge, was first used in an early version
of the Isis system to detect causal relationships in benign settings [BJ87b].5 This
algorithm is more costly than that in Section 5.6.1. However, it is interesting
because it also satisfies the following property (which is slightly weaker than F),
even if both the senders of mi and m2 are corrupt:
F': If C(mi, m?) is true at any correct recipient of m1 and m2, then there exists
a message m3 with the same contents as m1 such that m? m2.
5We are grateful to Thshar Chandra for suggesting the idea of piggybacking, which lead us to
the algorithm of this section.
loo
Note that this property does not ensure that mi H m2, but only that some
message identical to mi causally precedes m2. While F' holds with no assumptions
on the senders of mi and m2, it is primarily of interest in the case in which the
sender of m2 is correct. In this case, F' can substantially limit what a corrupt
process can choose for the contents of mi once m2 is sent (if C(rni, m?) is to be
true). Moreover, we will describe additions to our algorithm that place even greater
restrictions on the contents of mi.
Intuitively, the algorithm is very simple. When a process P sends a message m,
it "piggybacks" on (i.e., includes with) m a set Hm of all messages that P received
in the past and the messages piggybacked on them. This is illustrated in Figure
5.3, where P sends m1 to R and then m to Q, and then Q sends m2 to R. A
process that receives two messages rni and m2 considers mi to be causally before
m2 only if (a message with the same contents as) mi appears in Hm2.
Here P sends mi to R and then sends m to Q,
piggybacking m? on rn. Then, Q piggybacks
mi and m on message m2 to R.
Figure 5.3: The piggybacking algorithm
More precisely, the algorithm executes as follows:
1. Each process Pj maintains a set hj that is initially empty.
`01
2. If Pj executes send(m), Hm = hj is sent with m.
3. If 1) executes receive(m), it sets hj to
hj U Hm U tml.
Processes detect causal relationships as follows:
C(m1, m?) =			true			if rrt1 ? Hm2,
false			otherwise.
Here, "ml E Hm2,, means that a message with contents identical to mi appears in
Hm2.
While the algorithm already satisfies F', additional measures must be taken to
satisfy E and to satisfy F if the sender of mi is correct. To satisfy F under only
the assumption that the sender of mi is correct, it must not be possible for the
sender of m2 to include (the contents of) mi in Hm2 unless mi m2. That is,
m1 must be unpredictable. In addition, to satisfy E, the contents of messages sent
by correct processes must be unique. To see why, suppose there exist messages mi
and m2 such that nij causally precedes m2 by means of a causal chain traversing
correct processes only. If the sender of m2 had previously sent a message whose
contents were identical to those of m2, then this message could appear in Hmi,
thus causing C(m2, mi) to be true at a correct recipient of mi and m2.
One way to make correct processes' messages unique and unpredictable is for
the k-th message m from Pj to I) to be constructed in the form "?i,j, k : datalKi"
where data denotes the data to be sent in the message (not including Hm). Speci-
fying i, j and k in the message makes the message contents unique, and including
the signature makes the message contents unpredictable. Then, we can prove
Theorem 5.4 This algorithm satisfies E and F', and satisfies F if the sender of
mi is correct.
Proof.
102
E: Suppose there exists a causal chain ...... , e? such that e1 = send(m1),
= send(m2), and for each j c ?1,..., 1?, e? is executed at a correct process.
By construction, mi E Hm2; so, C(mi, m?) is true. In addition, since the
signature in m2 first appeared when m2 was sent and because m2 74 m1, m2
could not be in Hm1.
F': Suppose that a correct process R receives ?i and m2, and that C(m1, m?)
is true at R. Then, mi ? Hm2. Consider any causal chain e1...., el of
maximum length such that el = send(m) for some m, e? = receive(m2) at
R, and mi E Hm. Such a chain exists because, e.g., the chain send(m2)
receive(m2) satisfies these requirements. Then, there is some message m'
identical to m1 such that receive(m') was executed at the sender of m before
send(m).6 So, m' H m2.
F; Suppose that a correct process R receives m1.and m2, the sender P of m1
is correct, and C(m1,m2) is true at R. Then, mi Ei Hm2. If P sent m2, then
because each message sent by P is unique, mi ? Hm2 implies that mj m2.
If another process sent m2, then m1			m2 because the contents of mi cannot
be predicted by the sender of m2.
As mentioned earlier, F' is of interest primarily when the sender of m2 is correct
(and thus does not cooperate with the sender of mi to forge causal relationships).
To see why, suppose that a corrupt process P intends to send a message mi,
m1 74 m2, 50 that C(mi, m?) is true at a correct common destination R. ?? dictates
that P choose the contents of mi from those messages m? such that m? m2. If
m2 has not yet been sent, P could try to predict its possible choices for mi and
send these to the sender of m2. Once m2 is sent, however, P's choices are limited.
6Strictly speaking, the sender of m, if corrupt, could have created m' and included m' in Hm,
without receiving m'. For all practical purposes, however, this can be modeled as it sending m'
to itself (and thus receiving m') before sending m.
103
Moreover, by adding some additional checking to our algoritlim, we can further
narrow the choices available for the contents of m1. Note that after receiving m1
(on the channel from P) and m2, R can detect if
o+ the sender and receiver listed in m1 are not P and 1?, respectively,
o+ m1 is the k-th message that R received from P but the sequence number
listed in mi is not k,
o+ mi is not properly signed by P, or
o+ there are multiple (non-identical) messages in Hm2 listing the same sender,
receiver, and sequence number as mi and bearing P's signature.
Suppose that R defines C(mi, m?) to be false if any of these hold (and thus P is
corrupt), even if mi ? Hm2. Then, once m2 is sent, P has at most one choice
for the contents of each message mi it sends on its channel to R that will make
C(m1,m2) true at R.
Several improvements to this algorithm can be made in practice. First, instead
of piggybacking Hm 011 each message m, a process need only piggyback those mes-
sages in Hm not piggybacked on a prior message to the same destination. If the
destination maintains messages piggybacked from each sender, then Hm can be
reconstructed when m is received. Second, a message need not be transmitted sep-
arately if it will eventually reach its destination piggybacked on another message,
although this delays the former message to be received no earlier than the latter.
A third improvement (that is incompatible with the second) uses message di-
gests to limit the size of piggybacked messages. A message digest algorithm (e.g.,
[Riv91]) produces a fixed length message digest from an input of arbitrary length,
in such a way that it is computationally infeasible to produce any input having
a prespecified target message digest, or to produce two inputs having the same
message digest. So, for all practical purposes, a message digest uniquely identifies
104
an input. Using a message digest algorithm f, the algorithm can be modified to
limit the length of piggybacked messages as follows:
1. Each process Pj starts with hj initially empty.
2. If Pj executes send(m) and m is Pj'5 k-th message to I), then Hm = hi and
Dm = ti,j,k : f(m)lKi are sent with m.
3. If 1% executes receive(m), it sets hj to
hjUHmUfDml.
The predicate to detect causal relationships becomes
true			if fi,j,k : f(mi))Kj e Hm2, where
C(mi,m?) =			mi is of the form ?i,j,k : data1Ki?
false			otherwise.
The four previously mentioned checks on m1 and Hm2 can also be employed in
this new algorithm. Moreover, it is not difficult to verify that while the item Dm
created during a send(m) must contain a signature, the message m does not. So,
this algorithm can be optimized further by not requiring a signature on m and
removing the third of the four additional checks enumerated above.
Other possible improvements include garbage collecting messages from the hi's
(at the cost of sacrificing E in some cases), when causal relationships involving
those messages are no longer of interest.
5.7 Summary and discussion
In this chapter we have attempted to formalize the problems with detecting causal-
ity in hostile environments and to provide algorithms to overcome these problems
in some situations. In particular, we have introduced two new notions--Hdenial
105
and forger?that capture the ways in which causality can be mistakenly not de-
tected or detected. We have presented two algorithms for preventing denial and
two algorithms for preventing or limiting forgery in some situations.
We became aware of the importance of detecting causality in hostile environ-
ments during the initial stages of our work to integrate security into the Horus
system. Because Horus employs a variation of the conservative protocol of Section
5.5.2 (see [BSS91]), causal denial is already prevented in the system in some cases.
One direction for future work might be to implement others of our algorithms in
the system for comparison purposes and to provide greater defense against attacks
on causality detection. Other directions for future work are discussed in Section
5.7.2.
5.7.1 Related work
In a private communication in February 1993, Doug Tygar notified us of some
closely related work being done by him and Sean Smith [ST93]. In their work,
they have similarly identified the detection of causal relationships to be important
in hostile environments and have pursued ways of detecting causal relationships
despite malicious behavior. In particular, they independently discovered a protocol
very similar to that of Section 5.6.1 of this chapter (see [ST93J).
Despite these similarities, however, the works are also substantially different.
Smith and Tygar arrived at a different formulation of the problem of detecting
causal relationships and, to our knowledge, developed only an approach to pre-
venting (what we call) forgery, namely the technique of signed vector timestamps
described in Section 5.6.1 of this work. On the other hand, while we have studied
causality detection only as an essential aspect of preserving zntegrn'ty in distributed
systems, they also raise consideration of the ramifications of causality detection on
secrecu requirements. In particular, in [ST93] they introduce notions of forward
confinement and backward confinement, which characterize requirements that in-
106
formation about a process not leak to other processes executing events causally
after or before its own, respectively. The interested reader is referred to [ST93] for
further discussion of these issues.
5.7.2 Future work
There are several directions for future work that we hope to pursue. The most
obvious is to find new and better algorithms to detect causal relationships. In
particular, what can be done toward satisfying D or F if the sender of mi can
be corrupt should be examined more closely. Less general algorithms that exploit
knowledge of communication patterns are also of interest, especially if applicable
to large classes of distributed algorithms.
We also hope to examine further uses of causality for security. For example,
consider the following use of causality to detect freshness, a property studied ex-
tensively by the security community. A message is fresh in a run of a protocol if its
contents have not appeared in another message sent before this run of the protocol
began [BAN89,AT91]. One way to detect freshness is to use causality: if a message
can be verified to be causally after a fresh message, then it too should be considered
fresh. One common technique for detecting freshness, namely challenge-response
interactions [NS78J, is an instance of this method. In this technique, P challenges
Q with a new, unpredictable nonce identifier, which Q must include in its response
to P. The appearance of the nonce identifier in the response convinces P that the
response was computed causally after P's message and thus that the response is
fresh. This technique could be generalized using the techniques of Section 5.6 to
enable a process other than the challenger to verify the freshness of the response.
In the future we hope to examine other uses of causality detection.
Another direction for future research is to explore the degree to which pat-
terns of communication must be restricted to prevent denial and forgery in certain
situations. It is interesting to note that both of our algorithms for preventing de-
107
nial synchronize communication: they eliminate all executions in which there are
messages m1 and m2 such that the sender of mi is correct, mi m2, and yet
m2 is received before mi at a correct common destination. On the contrary, nei-
ther of our algorithms for preventing forgery restrict patterns of communication at
all. We suspect that these are not properties of our algorithms alone, but suggest
requirements inherent in the problems.
Finally, another difficult problem is how a process P can determine whether
it has received all messages sent to it that are causally prior to a certain received
message. Such determinations are necessary if, e.g., P must deliver received mes-
sages to an application in an order consistent with the causal relationships among
them (e.g., [Sch90,BSS91]). The algorithms of Section 5.5 ensure that all causally
prior messages have been received if all such messages are sent by correct pro-
cesses, although this does not necessarily hold if a causally prior message is sent
by a corrupt process.
Chapter 6
Conclusion
In this thesis we have considered problems involved in providing security and fault-
tolerance simultaneously in distributed systems. One premise of this work, ex-
pressed previously in the literature, is that the goals of security and availability
are in conflict, because by replicating data and services for availability, one also
makes them harder to protect. This conflict is particularly troublesome for security-
critical data and services that underpin the security mechanisms of a distributed
system.
One of the primary contributions of this thesis is the design and implemen-
tation of a security architecture for fault-tolerant systems. The tradeoff between
security and availability is addressed in two ways in our architecture. At the level
of user applications, the architecture provides a framework that enables the user
to balance this tradeoff for each application individually. The framework consists
of secureprocessgroupswithin which a user can efficiently replicate applications in
a protected fashion, as described in Chapter 4. Authentication and access control
mechanisms enable the group members to prevent untrusted processes from join-
ing. And, provided that the members admit only processes on trustworthy sites,
the members will enjoy secure communication and correct group semantics among
themselves. Thus, the conflict between security and availability at this level of the
108
109
system is passed to the user, by providing the authentication and access control
mechanisms necessary to enable him or her to control where and how widely each
application is replicated.
The second level at which this conflict is addressed is in the core security
services that underlie these abstractions, and indeed the security of all process
groups. These services are an authentication service and a time service, which
were described in Chapter 3. By using replication only when necessary, and by in-
troducing novel replication techniques when it was necessary, we constructed these
services to be easily defensible against attack. And, the transient unavailability
of even a substantial number of servers does not hinder key distribution between
principals or expose protocols to intruder attacks. While these services are used
in our architecture to support the distribution of group keys during the group join
protocol, they could be employed in many settings to achieve fault-tolerant key
distribution.
The method by which the authentication service was replicated is actually an
instance of a more general replication technique that is applicable to a wide range of
services. This technique, which we described in Chapter 2, can be used to replicate
a service so that it will remain correct and available despite the corruption of some
servers and clients. Moreover, this technique has advantages over previous work
in this area, in that it does not require clients to authenticate servers individually
and it defends against attempts by an intruder to exploit violations of causality in
the sequence of requests processed by the service.
The accurate detection of causal relationships can be important for security in
other settings, as well. In some cases, it may be essential to the implementation
of a security policy that if a causal relationship exists, then it is detected. In
others, it may be more important that if a relationship is detected, then it actually
exists. In Chapter 5, we provided a framework for expressing these requirements
and algorithms to achieve them in some situations.
110
Looking to the future, our work brings to mind many directions for further in-
vestigation. Those most closely related to the technical content of this thesis have
already been described in preceding chapters. More generally, however, the prob-
lem of addressing security and fault-tolerance simultaneously can be viewed as an
instance of the more general problem of combining different computer system de-
pendability [Lap89J requirements (e.g., safety, security, reliability). We believe that
combining such requirements in real systems holds many exciting opportunities for
future research.
Bibliography
[ADKM92]
[AT9l]
[BAN89]
Y. Amir, D. Dolev, 5. Kramer, and D. Malki. Transis: A commmuni-
cation sub-system for high availability. In Proceedings of the Twenty-
Second International Symposium on Fault- Tolerant Computing, pages
76--H84, July 1992.
M. Abadi and M. R. Thttle. A semantics for a logic of authentication.
In Proceedings of the Tenth Annual ACM Symposium on Principles of
Distributed Computing, pages 201--H216, August 1991.
M. Burrows, M. Abadi, and R. Needham. A logic of authentica-
tion. Technical Report 39, Digital Equipment Corporation Systems
Research Center, February 1989.
[BCG9l] K. P. Birman, R. Cooper, and B. Gleeson. Design alternatives for
process group membership and multicast. Technical Report 91-1257,
Department of Computer Science, Cornell University, December 1991.
[Bir93]
[BJ87a]
[BJ87b]
[BM9OJ
K. P. Birman. The process group approach to reliable distributed
computing. To appear in Communications of the ACM? 1993.
K. P. Birman and T. A. Joseph. Exploiting virtual synchrony ill dis-
tributed systems. In Proceedings of the Eleventh Symposium on Op-
erating Systems Principles, pages 123--H138, November 1987.
K. P. Birman and T. A. Joseph. Reliable communication in the pres-
ence of failures. ACM Transactions on Computer Systems, 5(1):47--H76,
February 1987.
5. M. Bellovin and M. Merritt. Limitations of the Kerberos authen-
tication system. Computer Communication Review, 20(5): 119--H132,
October 1990.
E. Brickell. A survey of hardware implementations of RSA. In
G. Brassard, editor, Advances in Cryptology--HCRYPTO `89 Proceed-
[Bri9O]
111
112
[BSS91]
[CASD85]
[CD89]
[Cri89]
[CZ85]
[DDN9l]
[Den84]
[DES77J
[DF9O]
[DF92]
[D1179]
ings, Lecture Notes in Computer Science 435, pages 368--H370. Springer-
Verlag, 1990.
K. P. Birman, A. Schiper, and P. Stephenson. Lightweight causal and
atomic group multicast. ACM Transactions on Computer Systems,
9(3):272--H314, August 1991.
F. Cristian, II. Aghili, R. Strong, and D. Dolev. Atomic broadcast:
From simple message diffusion to Byzantine agreement. In Proceedings
of the Fifteenth International Symposium on Fault- Tolerant Comput-
ing, pages 200--H206, June 1985. A revised version appears as IBM
Research Laboratory Technical Report RJ5244 (April 1989).
B. Chor and C. Dwork. Randomization in Byzantine agreement. Ad-
vances in Computer Research, 5:443--H497,1989.
F. Cristian. Probabilistic clock synchronization. Distributed Comput-
ing, 3(3):146--H158, 1989.
D. R. Cheriton and W. Zwaenepoel. Distributed process groups fn
the V kernel. ACM Transactions on Computer Systems, 3(2):77--H107
May 1985.
D. Dolev, C. Dwork, and M. Naor. Non-malleable cryptography. In
Proceedings of the 23rd Annual ACM Symposium on Theory of Com-
puting, pages 542--H552, May 1991.
D. E. Denning.
cryptosystems.
1984.
Digital signatures with RSA and other public-key
Communications of the ACM? 27(4):388--H392, April
Data encryption standard. National Bureau of Standards, Federal In-
formation Processing Standards Publication 46, Government Printing
Office, Washington, D. C., 1977.
Y. Desmedt and Y. Frankel. Threshold cryptosystems. In G. Brassard,
editor, Advances in Cryptology--HCRYPTO `89 Proceedings, Lecture
Notes in Computer Science 435, pages 307--H315. Springer-Verlag, 1990.
Y. Desmedt and Y. Frankel. Shared generation of authenticators
and signatures. In J. Feigenbaum, editor, Advances in Cryptology--H
CRYPTO `91 Proceedings, Lecture Notes in Computer Science 576,
pages 457--H469. Springer-Verlag, 1992.
W. Diffie and M. E. Hellman. Privacy and authentication: An in-
troduction to cryptography. Proceedings of the IEEE, 67(3):397--H427,
March 1979.
113
[DoD85] Department of Defense. Department of Defense trusted computer sys-
tem evaluation criteria, DOD 5200.28--HSTD, December 1985.
[DR86]
[DS81]
[E1G85]
[FD92]
[Fel87]
[FLP85]
[GM82]
[GMD91]
[Gon89a]
J. E. Dobson and B. Randell. Building reliable secure computing
systems out of unreliable insecure components. In Proceedings of the
1986 IEEE Symposium on Security and Privacy, pages 187--H193, April
1986.
D. E. Denning and G. M. Sacco. Timestamps in key distribution
protocols. Communications of the ACM, 24(8):533--H536,August 1981.
T. ElGamal. A public key cryptosystem and a signature scheme based
on discrete logarithms. IEEE Transactions on Information Theory,
IT-31(4):469--H472, July 1985.
Y. Frankel and Y. Desmedt. Distributed reliable threshold multisig-
nature. Technical Report TR--H92-04--H02, Department of EE & CS,
University of Wisconsin at Milwaukee, April 1992.
P. Feldman. A practical scheme for non-interactive verifiable secret
sharing. In Proceedings of the 28th Annual Symposium on Foundations
of Computer Science, pages 427--H437, October 1987.
M. J. Fischer, N. A. Lynch, and M. 5. Paterson. Impossibility of
distributed consensus with one faulty process. Journal of the ACM?
32(2):374--H382, April 1985.
J. A. Goguen and J. Meseguer. Security policies and security models.
In Proceedings of the 1982 IEEE Symposium on Security and Privacy,
pages 11--H20, April 1982.
J. M. Galvin, K. McCloghrie, and J. R. Davin. Secure management of
SNMP networks. In I. Krishnan and W. Zimmer, editors, Integrated
Network Management, H, pages 703--H714. Elsevier Science Publishers
B.V. (North-Holland), April 1991.
L. Gong. Securely replicating authentication services. In Proceedings
of the Ninth International Conference on Distributed Computing Sys-
tems, pages 85--H91, 1989.
[Gon89b] L. Gong. Using one-way functions for authentication. Computer Com-
munication Review, 19(5):8--H11, October 1989.
[Gon92] L. Gong. A security risk of depending on synchronized clocks. ACM
Operating Systems Review, 26(1):49--H53, January 1992.
114
[Gon93]
[GSTC9O]
[GZ84]
[HT88]
[HW90]
[JA88]
[Jos87]
[KT91]
[LABW92]
L. Gong. Increasing availability and security of an authentication ser-
vice. To appear in IEEE Journal on Selected Areas in Communica-
tions, 1993.
A. Gopal, R. Strong, 5. Toueg, and F. Cristian. Early-delivery atomic
broadcast. In Proceedings of the Ninth Annual ACM Symposium on
Principles of Distributed Computing, pages 297--H309, August 1990.
R. Gusella and 5. Zatti. TEMPO A network time controller for a
distributed Berkeley UNIX system. In Proceedings of the USENIX
Summer Conference, pages 78--H85, June 1984.
M. P. Herlihy and J. D. Tygar. How to make replicated data se-
cure. In C. Pomerance, editor, Advances in Cryptology--HCRYPTO
`87 Proceedings, Lecture Notes in Computer Science 293, pages 379--H
391. Springer-Verlag, 1988.
M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition
for concurrent objects. A CM Transactions on Programming Languages
and Systems, 12(3):463--H492, July 1990.
M. K. Joseph and A. Avizienis. A fault tolerance approach to com-
puter viruses. In Proceedings of the 1988 IFEE Symposium on Security
and Privacy, pages 52--H58, April 1988.
M. K. Joseph. Towards the elimination of the effects of malicious logic:
Fault tolerance approaches. In Proceedings of the 10th NBS/NCSC
National Computer Security Conference, pages 238--H244, September
1987.
M. F. Kaashoek and A. 5. Tanenbaum. Group communication in the
Amoeba distributed operating system. In Proceedings of the Eleventh
International Conference on Distributed Computing Systems, pages
222--H230, May 1991.
B. Lampson, M. Abadi, M. Burrows, and E. Wobber. Authentication
in distributed systems: Theory and practice. ACM Transactions on
Computer Systems, 10(4)265--H310, November 1992.
[Lam78a] L. Lamport. The implementation of reliable distributed multiprocess
systems. Computer Networks, 2:95--H114, 1978.
[Lam78b] L. Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM 21(7):558--H565, July 1978.
115
[Lap89]
J. C. Laprie. Dependability: A unifying concept for reliable computing
and fault tolerance. In T. Anderson, editor, Dependability of Resilient
Computers, chapter 1, pages 1--H28. BSP Professional Books, 1989.
[LH9l] C. 5. Laih and L. Harn. Generalized threshold cryptosystems. In
Proceedings of ASIA CRYPT `91,1991.
[LLSG92] R. Ladin, B. Liskov, L. Shrira, and 5. Ghemawat. Providing high
availability using lazy replication. ACM Transactions on Computer
Systems, 10(4) :360--H391, November 1992.
[Mar90] K. Marzullo. Tolerating failures of continuous-valued sensors. ACM
Transactions on Computer Systems, 8(4):284--H304, November 1990.
[Mar93]			K. Marzullo. Personal communication, February 1993.
[Mat89]
[Mi189]
[MPS92]
[NS78]
[OR87]
[OSA9l]
F. Mattern. Virtual time and global states in distributed systems. In
Proceedings of the International Workshop on Parallel and Distributed
Algorithms, pages 215--H226. Elsevier Science Publishers B. V. (North-
Holland), 1989.
D. L. Mills. Network Time Protocol (version 2) specification and im-
plementation. RFC 1119, Network Working Group, September 1989.
5. Mishra, L. L. Peterson, and R. D. Schlicting. A membership pro-
tocol based on partial order. In J. F. Meyer and R. D. Schlicting,
editors, Dependable Computing for Critical Applications 2, pages 309--H
331. Springer-Verlag, 1992.
R. M. Needham and M. D. Schroeder. Using encryption for authenti-
cation in large networks of computers. Communications of the A CM,
21(12):993--H999, December 1978.
D. Otway and 0. Rees. Efficient and timely mutual authentication.
ACM Operating Systems Review, 21(1):8--H11, January 1987.
B. Orup, E. Svendsen, and E. Andreasen. VICTOR: An efficient RSA
hardware implementation. In I. B. Damgard, editor, Advances in
Cryptology--HEUROCRYPT `90 Proceedings, Lecture Notes in Com-
puter Science ?73, pages 245--H252. Springer-Verlag, 1991
L. L. Peterson, N. C. Buchholz, and R. D. Schlichting. Preserving
and using context information in interprocess communication. ACM
Transactions on Computer Systems, 7(3):217--H246, August 1989.
[PB589]
116
[Rab89]
[RB91]
[RB92]
[RBG92]
[RBvR93]
[RD86]
[RD91]
[RG93j
[Ric92]
[Riv91]
[RSA78j
M. 0. Rabin. Efficient dispersal of information for security, load bal-
ancing, and fault tolerance. Journal of the ACM, 36(2):335--H348, April
1989.
A. M. Ricciardi and K. P. Birman. Using process groups to imple-
ment failure detection in asynchronous environments. In Proceedings
of the Tenth Annual ACM Symposium on Principles of Distributed
Computing, pages 341--H351, August 1991.
M. K. Reiter and K. P. Birman. How to securely replicate services.
Technical Report 92-1287, Department of Computer Science, Cornell
University, June 1992.
M. K. Reiter, K. P. Birman, and L. Gong. Integrating security in a
group oriented distributed system. In Proceedings of the 1992 IEEE
Symposium on Research in Security and Privacy, pages 18--H32, May
1992.
M. K. Reiter, K. P. Birman, and R. van Renesse. A security architec-
ture for fault-tolerant systems. Technical Report 93-1354, Department
of Computer Science, Cornell University, June 1993.
B. Randell and J. E. Dobson. Reliability and security issues in dis-
tributed computing systems. In Proceedings of the Fifth Symposium
on Reliability in Distributed Software and Database Systems, pages
113--H118, January 1986.
R. L. Rivest and 5. Dusse. The MD5 message-digest algorithm, July
1991.
M. K. Reiter and L. Gong. Preventing denial and forgery of causal
relationships in distributed systems. In Proceedings of the 1993 IEEE
Symposium on Research in Security and Privacy, pages 30--H40, May
1993.
A. M. Ricciardi. The Group Membership Problem in Asynchronous
Systems. Ph.D. dissertation, Cornell University, November 1992.
R. L. Rivest. The MD4 message digest algorithm. In A. J. Menezes
and 5. A. Vanstone, editors, Advances in Cryptology--HCRYPTO `90
Proceedings, Lecture Notes in Computer Science 537, pages 303--H311.
Springer-Verlag, 1991.
R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining
digital signatures and public-key cryptosystems. Communications of
the ACM, 21(2):120--H126, February 1978.
117
[Sch9O]
[Sed88]
[SES+92]
[5G92]
[Sha79]
[5NS88]
[5575]
[5T93]
[5V93]
[SWL9o]
F. B. Schneider. Implementing fault-tolerant services using the state
machine approach: A tutorial. ACM Computing Surveys, 22(4):299--H
319, December 1990.
H. Sedlak. The RSA cryptography processor. In D. Chaum and W. L.
Price, editors, Advances in Cryptology: Proceedings of EUROCRYPT
`87, Lecture Notes in Computer Science 304, pages 95--H105. Springer-
Verlag, 1988.
5. K. Shrivastava, P. D. Ezhilchelvan, N. A. Speirs, 5. Tao, and
A. Tully. Principal features of the VOLTAN family of reliable node ar-
chitectures for distributed systems. IEEE Transactions on Computers,
41(5):542--H549, May 1992.
5. G. Stubblebine and V. D. Gligor. On message integrity in cryp-
tographic protocols. In Proceedings of the 1992 IEEE Symposium on
Research in Security and Prnvacy, pages 85--H104, May 1992.
A. Shamir. How to share a secret. Communications of the ACM?
22(11):612--H613, November 1979.
J. G. Steiner, C. Neuman, and J. I. Schiller. Kerberos: An authentica-
tion service for open network systems. In Proceedings of the USENIX
Winter Conference, pages 191--H202, February 1988.
J. II. Saltzer and M. D. Schroeder. The protection of information in
computer systems. Proceedings of the IEEE, 63(9):1278--H1308,Septem-
ber 1975.
5. Smith and J. D. Tygar. Signed vector timestamps: A secure proto-
col for partial order time. Technical Report CMU-CS-93-116, School
of Computer Science, Carnegie Mellon University, February 1993.
M. Shand and J. Vuillemin. Fast implementations of PSA cryptogra-
phy. In Proceedings of the 11th IEEE Symposium on Computer Arith-
metic, July 1993.
B. Simons, J. Lundelius Welch, and N. Lynch. An overview of clock
synchronization. In B. Simons and A. Spector, editors, Fault- Tolerant
Distributed Computing, Lecture Notes in Computer Science 448, pages
84--H96. Springer-Verlag, 1990.
J. J. Tardo and K. Alagappan. SPX: Global authentication using
public key certificates. In Proceedings of the 1991 IEEE Symposium
on Research in Security and Privacy, pages 232--H244, May 1991.
[TA9l]
118
[TH86]
[Tsu92]
[Tuc79j
[TY91]
[VK83]
[vRBC+92j
R. Thrn and J. Habibi. On the interactions of security and fault-
tolerance. In Proceedings of the 9th NBS/NCSC National Computer
Security Conference, pages 138--H142, September 1986.
G. Tsudik. Message authentication with one-way hash functions. In
Proceedings of IEEE INFOCOM `92, pages 2055--H2059, May 1992.
W. Tiichman. Heliman presents no shortcut solutions to the DES.
lEFE Spectrum, 16(7):40--H41, July 1979.
J. D. Tygar and B. 5. Yee. Strongbox. In J. L. Eppinger, L. B. Mum-
mert, and A. Z. Spector, editors, Camelot and Avalon, A Distributed
Transaction Facility, chapter 24, pages 381--H400. Morgan Kaufmann,
San Mateo, California, 1991.
V. L. Voydock and 5. T. Kent. Security mechanisms in high-level
network protocols. ACM Computing Surveys, 15(2):135--H171, June
1983.
R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephen-
son. Reliable multicast between microkernels. In Proceedings of
the USENIX Microkernels and Other Kernel Arohitectures Workshop,
April 1992.
