## Load Balancing Schemes for High-throughput Distributed Fault-Tolerant Servers

Roy Friedman and Daniel Mosse
TR96-1616
December 1996

Clusters of workstations, connected by a fast network, are emerging as a viable architecture for building high-throughput fault-tolerant servers. This type of architecture is more scaleable and more cost-effective than a tightly coupled multiprocessor and may achieve as good a throughput. Two of the most important issues that a designer of such clustered servers must consider in order for the system to meet its fault-tolerance and throughput goals are the load balancing scheme and the fault-tolerance scheme that the system will use. This paper explores several combinations of such fault tolerance and load-balancing schemes, and compare their impact on the maximum throughout achievable by the system, and on its survivability. In particular, we show that a fault-tolerance scheme may have an effect on the throughput of the system, while a load-balancing scheme may affect the ability of the system to override failures. We study the scaleability of the different schemes under different loads and failure conditions. The validation of our schemes is done using data taken from emulations of an intelligent networking coprocessor of a telephone switch, which follows, for example, the SS7 signaling protocol.

## Dynamic Light-Weight Groups

Katherine Guo and Luis Rodrigues
TR96-1612
October 28, 1996

The virtual synchrony model for group communication has proven to be a powerful paradigm for building distributed applications. In applications that require the use of a large number of groups, significant performance gains can be attained if these groups share the resources required to provide virtual synchrony. A service that maps user groups onto instances of a virtually synchronous implementation is called a Light-Weight Group Service. This paper discusses the usage of Light-Weight Group protocols in dynamic environments, where mappings cannot be defined a priori and may change over time. We show that it is possible to establish mappings that promote sharing and, at the same time, minimize interference. These mappings can be established in an automated manner, using heuristics applied locally at each node. Experiments using an implementation in the Horus system show that significant performance improvements can be achieved with this approach.

## A Dynamic Light-Weight Group Service

Luis Rodrigues, Katherine Guo, Antonio Sargento, Robbert van Renesse, Brad Glade, Paulo Verissimo and Kenneth Birman
TR96-1611
October 11, 1996

The virtual synchrony model for group communication has proven to be a powerful paradigm for building distributed applications. Implementations of virtual synchrony usually require the use of failure detectors and failure recovery protocols. In applications that require the use of a large number of groups, significant performance gains can be attained if these groups share the resources required to provide virtual synchrony. A service that maps user groups onto instances of a virtually synchronous implementation is called a Light-Weight Group Service. This paper proposes a new design for the Light-Weight Group protocols that enables the usage of this service in a transparent and dynamic manner. As a test case, the new design was implemented in the Horus system, although the underlying principles can be applied to other architectures as well. The paper also presents performance results from this implementation.

## The Hierarchical Daisy Architecture for Causal Delivery

Roberto Baldoni, Roy Friedman and Robbert van Renesse
TR96-1610
September 26, 1996

In this paper, we propose the hierarchical daisy architecture, which provides causal delivery of messages sent to any subset of processes. The architecture provides fault tolerance and maintains the amount of control information within a reasonable size. It divides processes into logical groups. Messages inside a logical group are sent directly, while messages that need to cross logical groups' bounderies are forwarded by servers. We proof the correctness of the daisy architecture and discuss possible optimizations.

## Software for Reliable Networks

Kenneth P. Birman and Robbert van Renesse
Scientific American, May, 1996

The failure of a single program on a single computer can sometimes crash a network of intercommunicating machines, causing havoc for stock exchanges, telephone systems, air-traffic control and other operations. Two software designers explain what can be done to make networks more robust.

## Horus, a flexible Group Communication System

Robbert van Renesse, Kenneth P. Birman and Silvano Maffeis
Communications of the ACM, April 1996.

The emergence of process-group environments for distributed computing represents a promising step towards robustness for mission-critical distributed applications. Process groups have a natural'' correspondence with data or services that have been replicated for availability, or as part of a coherent cache. They can been used to support highly available security domains. And, group mechanisms fit well with an emerging generation of intelligent network and collaborative work applications.

Robbert van Renesse
Proceedings of the 1996 ACM SIGCOMM Conference
Stanford, September 1996

Layering of protocols has been advocated as a way of dealing with the complexity of computer communication. It has also been criticized for its performance overhead. In this paper, we present some insights in the design of protocols, and how these insights can be used to mask the overhead of layering, in a way similar to client caching in a file system. With our techniques, we achieve an order of magnitude improvement in end-to-end message latency in the Horus communication framework. Over an ATM network, we are able to send and deliver messages of varying levels of semantics in about 85 microseconds, using a protocol stack of four layers that were written in ML, a high-level functional language.

## World Wide Failures

Werner Vogels
Proceedings of the ACM SIGOPS European Workshop, Connamoran, Ireland, September 1996

The one issue that unites almost all approaches to distributed computing is the need to know whether certain components in the system have failed or are otherwise unavailable. When designing and building systems that will need to function at a global scale, failure management will need to be considered a fundamental building block. This paper describes the development of a system-independent failure management servcies, which allows systems and applications to incorporate accurate detection of failed processes, nodes and networks without the need for making compromises in their particular design.

## A Framework for Protocol Composition in Horus

Robbert van Renesse, Kenneth P. Birman, Roy Friedman,
Mark Hayden, and David A. Karr
August 1995

The Horus system supports a communication architecture that treats protocols as instances of an abstract data type. This approach encourages developers to partition complex protocols into simple microprotocols, each of which is implemented by a protocol layer. Protocol layers can be stacked on top of each other in a variety of ways, at run-time. First, we describe the classes of protocols that can be supported this way. Next, we present the Horus object model that we designed for this technology, and the interface between the layers that makes it all work. We then present an example layer that implements a group membership protocol. Next, we show how, given a set of required properties, an appropriate stack can be constructed. We look at an example stack of protocols, which provides fault-tolerant, totally ordered communication between a group of processes. The work contributes a standard framework for protocol development and experimentation, provides a high performance implementation of the virtual synchrony model, and introduces a methodology for increasing the robustness of the protocol development process.

## Trading Consistency for Availability in Distributed Systems

Roy Friedman and Ken Birman
TR96-1579
April 8, 1996

This paper shows that two important classes of actions, non left commuting and strongly non commuting, cannot be executed by concurrent partitions in a system that provides serializable services. This result indicates that there is an inherent limitation to the ability of systems to provide services in a consistent manner during network partitions.

## Deciding in Partitionable Networks

Roy Friedman, Idit Keidar, Dalia Malki, Ken Birman and Danny Dolev
TR95-1554
November 27, 1995

Motivated by Chandra and Toueg's work, we study decision protocols in a model that closely approximates real'' distributed systems. Our results show how the weakest failure detector and associated consensus algorithm can be adapted to a network in which omission failures can occur during periods when processes suspect one-another as faulty. For protocols in which a majority subset of the participants can reach decisions on behalf of the system as a whole, we also characterize a series of stages that necessarily arise during execution. Jointly, these findings establish a direct relationship between an extended version of the three-phase commit protocol, which makes progress even when a traditional three-phase commit would block, and the consensus protocol of Chandra and Toueg. Although we do not explore the linkage here, our results should also be applicable to other agreement protocols for systems of this sort, such as leader election and dynamic group membership.

## Strong and Weak Virtual Synchrony in Horus

Roy Friedman and Robbert van Renesse
August 24, 1995

A formal definition of {\em strong virtual synchrony}, capturing the semantics of virtual synchrony as implemented in Horus, is presented. This definition has the nice property that every message is delivered within the view in which it was sent. However, it is shown that in order to implement strong virtual synchrony, the application program has to block messages during view changes. An alternative definition, called {\em weak virtual synchrony}, which can be implemented without blocking messages, is then presented. This definition still guarantees that messages will be delivered within the view in which they were sent, only that it uses a slightly weaker notion of what the view in which a message was sent is. An implementation of weak virtual synchrony that does not block messages during view changes is developed, and it is shown how to use a system that provides weak virtual synchrony even when strong virtual synchrony is actually needed. To capture additional ordering requirements, the definition of {\em ordered virtual synchrony} is presented. Finally, it is discussed how to extend the definitions in order to cope with the fact that a process can become a member of more than one group.

## Packing Messages as a Tool for Boosting the Performance of Total Ordering Protocols

Roy Friedman and Robbert van Renesse
July 07, 1995

This paper compares the throughput and latency of four protocols that provide total ordering. Two of these protocols are measured with and without message packing. We used a technique that buffers application messages for a short period of time before sending them, so more messages are packed together. The main conclusion of this comparison is that message packing influences the performance of total ordering protocols under high load overwhelmingly more than any other optimization that was checked in this paper, both in terms of throughput and latency. This improved performance is attributed to the fact that packing messages reduces the header overhead for messages, the contention on the network, and the load on the receiving CPUs.

## Using Virtual Synchrony to Develop Efficient Fault Tolerant Distributed Shared Memories

Roy Friedman
March 31, 1995

This paper shows how to define consistency conditions for distributed shared memories in virtually synchronous environments. Such definitions allow to develop fault tolerant implementations of distributed shared memories, in which during normal execution, operations can be performed very efficiently, and only those operations which take place during a configuration change must be delayed. Three well known consistency conditions, namely, linearizability, sequential consistency, and causal memory, are redefined for virtually synchronous environments. It is then shown how to provide efficient fault tolerant implementations for these definitions.

## Protocol Composition in Horus

Robbert Van Renesse and Kenneth P. Birman
March 29, 1995

Horus is a communication architecture that treats a protocol as an abstract data type. Protocol layers can be stacked on top of each other in a variety of ways, at run-time. This paper starts out with describing the many classes of protocols that can be supported this way. Next, we describe the Horus object model that we designed for this technology, and the interface between the layers that makes it all work. We then present an example layer which implements a group membership protocol. Then, we look at a example stack of protocols, which provides fault-tolerant, totally ordered communication between a group of processes. We conclude with presenting some remaining challenges in our project.

## Horus: A Flexible Group Communications System

Robbert Van Renesse, Kenneth P. Birman, Bradford B. Glade, Katie Guo, Mark Hayden, Takako Hickey, Dalia Malki, Alex Vaysburd and Werner Vogels
March 23, 1995

The Horus system offers flexible group communication support for distributed applications. It is extensively layered and highly reconfigurable, allowing applications to only pay for services they use, and for groups with different communication needs to coexist in a single system. The approach encourages experimentation with new communication properties and incremental extension of the system, and enables us to support a variety of application-oriented interfaces.

## Achieving Critical Reliability With Unreliable Components and Unreliable Glue

Mark Hayden and Kenneth P. Birman
March 14, 1995

Even the most aggressive quality assurance procedures yield at best probabilistic confidence in the reliability of complex systems. Distributed systems, because of their large numbers of components, are enormously complex engineering artifacts, and hence may appear to be inherently unreliable -- despite the best efforts of researchers and developers. A cellular distributed systems architecture offers the hope of drastically improving the reliability of current technologies in settings where reliability is critical. The approach combines a stateful style of distributed computing within cells with a loosely coupled probabilistic inter-cell computing model based on a probabilistic broadcast primitive. We give an implementation of this primitive, called pbcast, and demonstrate how to use it to implement this methodology. Our approach is compatible with the use of popular distributed computing and reliability technologies, while offering considerable isolation against the spread of failures among cells.

## Preserving Privacy in a Network of Mobile Computers

David A. Cooper and Kenneth P. Birman
March 03, 1995

Even as wireless networks create the potential for access to information from mobile platforms, they pose a problem for privacy. In order to retrieve messages, users must periodically poll the network. The information that the user must give to the network could potentially be used to track that user. However, the movements of the user can also be used to hide the user's location if the protocols for sending and retrieving messages are carefully designed. We have developed a replicated memory service which allows users to read from memory without revealing which memory locations they are reading. Unlike previous protocols, our protocol is efficient in its use of computation and bandwidth. In this paper, we will show how this protocol can be used in conjunction with existing privacy preserving protocols to allow a user of a mobile computer to maintain privacy despite active attacks.

## Incorporating System Resource Information into Flow Control

Takako M. Hickey and Robbert Van Renesse
February 27, 1995

Upcall-based distributed systems have become widespread in recent years. While upcall-based systems provide some obvious advantages, experiences with these systems have exposed unanticipated problems of unpredictability and inefficiency. Incorporating system resources information into flow control is essential in solving these problems. Variants of window-based flow control suitable for distributed systems are investigated. Next, message packing, which improves network bandwidth usage efficiency, and, consequently, message throughput, is presented. Finally, a back pressure mechanism which controls admission of messages into the system by blocking applications at high load is presented. The combination of the window mechanism and the back pressure mechanism provides end-to-end management of system resources. The former manages network resources, while the latter manages operating system resources. The combination maintains good throughput even under high load.

## Design and Performance of Horus: A Lightweight Group Communications System

Robbert van Renesse, Takako M. Hickey, and Kenneth P. Birman
december 1994

The Horus project seeks to develop a communication system addressing the requirements of a wide variety of distributed applications. Horus implements the group communications model providing (among others) unreliable or reliable FIFO, causal, or total group multicasts. It is extensively layered and highly reconfigurable allowing applications to only pay for services they use. This architecture enables groups with different communication needs to coexist in a single system. The approach permits experimentation with new communication properties and incremental extension of the system, and enables us to support a variety of application-oriented interfaces. Our initial experiments show good performance.

## Support for Complex Multi-Media Applications using the Horus system.

Werner Vogels and Robbert van Renesse
December 1994.

A distributed multi-media application involves more than just protocols for the dissemination of video and audio data. As in any other distributed application, protocols are necessary that guarantee the consistency, fault-tolerance, and security of shared data objects. The Horus system offers a framework for buildin g complex distributed systems that involve any number of protocols, as well as a variety of protocols for the diffe rent aspects of a distributed application (including some protocols specific to multi-media applications). We believe that this integrated approach is superior to combining different toolkits, and illustrate this with a detailed example of an existing video-on-demand application.

## A Security Architecture for Fault-Tolerant Systems

Michael K. Reiter, Kenneth P. Birman and Robbert Van Renesse
June 1993

Process groups are a common abstraction for fault-tolerant computing in distributed systems. We present a security architecture that extends the process group into a security abstraction. Integral parts of this architecture are services that securely and fault-tolerantly support cryptographic key distribution using novel techniques. We detail the design and implementation of these services and the secure process group abstraction they support. We also give performance figures for some common group operations.

## Preserving Privacy in a Network of Mobile Computers

David A. Cooper and Kenneth P. Birman
October 26, 1994

Even as wireless networks create the potential for access to information from mobile platforms, they pose a problem for privacy. In order to retrieve messages, users must periodically poll the network. The information that the user must give to the network could potentially be used to track that user. However, the movements of the user can also be used to hide the user's location if the protocols for sending and retrieving messages are carefully designed. In this paper we will describe a set of protocols that we have developed to allow a user with a mobile computer to communicate without compromising privacy.

## Uniform Actions in Asynchronous Distributed Systems

Dalia Malki, Kenneth P. Birman, Aleta M. Ricciardi and Andre Schiper
September 08, 1994

We develop necessary conditions for the development of asynchronous distributed software that will perform {\em uniform} actions (events that if performed by any process, must be performed at all processes). The paper focuses on {\em dynamic uniformity}, which differs from the classical problems in that processes continually leave and join the ongoing computation. Here, we first treat a static version of the problem (lacking joins), and then extend the results so obtained to also include joins. Our results demonstrate that in contrast to Consensus, which cannot be solved in asynchronous systems with even a single faulty process, dynamic uniformity can be solved using a failure detection mechanism that makes bounded numbers of mistakes. Because dynamic uniformity arises in systems that maintain safety within a primary partition'' of a network, our paper provides a rigorous characterization of the framework upon which several existing distributed programming environments are based.

## Understanding Partitions and the No Partition'' Assumption

Aleta M. Ricciardi, Andre Schiper and Kenneth P. Birman
June 1993

The paper discusses partitions in asynchronous message-passing systems. In such systems slow processes and slow links can lead to virtual partitions that are indistinguishable from real ones. This raises the following question: what is a partition'' in an asynchronous system? To overcome the impossibility of detecting crashed processes in an asynchronous system, our system model incorporates a failure suspector to detect (possibly erroneously) process failures. Based on failure suspicions we give a definition of partitions that acccounts for real partitions as well as virtual ones. We show that under certain assumptions about the process behavior, any incorrect failure suspicion inevitably partitions the system. We then show how to interpret the absence of partition'' assumption.

## Virtually-Synchronous Communication Based on a Weak Failure Suspector

Andre Schiper and Aleta M. Ricciardi
April 1993

Failure detectors (or, more accurately, Failure Suspectors - FS) appear to be a fundamental service upon which to build fault-tolerant, distributed applications. This paper shows that a FS with very weak semantics (i.e. that delivers failure and recovery information in no specific order) suffices to implement virtually-synchronous communication (VSC) in an asynchronous system subject to process crash failures and network partitions. The VSC paradigm is particularly useful in asynchronous systems and greatly simplifies building fault-tolerant applications that mask failures by replicating processes. We suggest a three-component architecture to implement virtually-synchronous communication : 1) at the lowest level, the FS component; on top of it, 2a) a component that defines new views, and 2b) a component that reliably multicasts messages within a view. The issues covered in this paper also lead to a better understanding of the various membership service semantics proposed in recent literature.

## Process Membership in Asynchronous Environments

Aleta M. Ricciardi and Kenneth P. Birman
February 1993

The development of reliable distributed software is simplified by the ability to assume a fail-stop failure model. We discuss the emulation of such a model in an asynchronous distributed environment. The solution we propose, called Strong-GMP, can be supported through a highly efficient protocol, and has been implemented as part of a distributed systems software project at Cornell University. Here, we focus on the precise definition of the problem, the protocol, correctness proofs and an analysis of costs. Keywords: Asynchronous computation; Fault detection; Process membership; Fault tolerance; Process group.