Ensemble Reference Manual

Mark Hayden, Ohad Rodeh *
Copyright © 1997 Cornell University, 2000 Hebrew University

Abstract: Ensemble is a reliable group communication toolkit implemented in the Objective Caml programming language. The purposes of this implementation are: Throughout, we attempt to follow a design that supports a simple compilation of the protocols to C. Two intermediate stages have recently been taken in the direction 1) the construction of a native C interface for Ensemble (CE) and 2) an implementation in C of the core Ensemble system (available from www.northforknet.com).

1   Introduction

This document is attempting to serve several goals. It is intended to be:
Documentation TODO list:

Part I
The Ensemble Architecture

2   Identifiers

Ensemble uses a variety of identifiers for a variety of different purposes. Here we summarize the important ones and describe what they are used for. Their type definitions can be found in the type directory in the file corresponding to the name. Look in type/README for an up to date listing of these files. Most of the different identifiers support a similar interface for a variety of operations. Several of the identifiers are opaque, in the sense that the interface hides the actual structure of the identifier. All identifiers have a string_of_id function defined which gives a human-readable description of their contents.
Changes from Horus

2.1   Endpoint Identifiers

Endpoint identifiers are unique names for communication endpoints. A single process can create any number of local endpoint identifiers, each of which is guaranteed to be unique (within some limits). A process can have multiple endpoints in a single group. An endpoint can be a member of multiple groups. However, the endpoints in a group must be distinct.

2.2   Group Identifiers

Group identifiers serve as unique names of communication groups. They do not contain addressing information. The exception to this rule is that groups communicating via Deering multicast choose a random multicast address by taking a hash of the group address. Processes can create any number of local group identifiers, each of which is guaranteed to be unique (within some limits).

2.3   View Identifiers

View identifiers are unique identifiers of group views. Whenever communication protocols proceed through a view change, the resulting view is given a new view identifier. These are made unique by pairing the coordinator of the group with a logical timestamp that is advanced whenever a view change happens. Although two partitions of a group may share the same time stamp, they will have different coordinators.

2.4   Connection Identifiers

Connection identifiers are used to route messages to the precise destinations. They specify the exact destination endpoint or group, the view identifier, the protocol stack to deliver to, the type of protocol being used to send the message, as well as several other bits of information. Typically, endpoint or group identifiers are used to send messages to the correct processes and connection identifiers are used to route messages to the exact destination of a message within a process. Messages usually contain a connection identifier as a ``header'' of the message but do not contain endpoint identifiers or group identifiers, except as subfields of a connection identifier.

2.5   Protocol Identifiers

There is a one-to-one relationship between the standard protocol stacks of Ensemble and protocol identifiers. Applications select the protocol to use by specifying the protocol id of that stack. Having identifiers for protocols is convenient because they can be passed around in messages and have equality comparisons made on them, whereas the actual protocol stacks cannot.

2.6   Mode identifiers

Each communication domain has a corresponding mode identifier used to specify that domain.

2.7   Stack Identifiers

Stack identifiers are used to distinguish between the various domains that a protocol stack may be receiving messages through. Each of the various kinds of ``channels'' that protocol stacks use has a separate identifier. Currently, these are:
[Primary :] This is the primary communication channel for a protocol stack. This is normally where most messages are received.
[Bypass :] Messages sent via the optimized bypass protocols use this id.
[Gossip :] Messages sent by group merge protocols use this id.
[Unreliable :] This stack id is reserved for unreliable stacks.

3   The Event Module

Events in Ensemble are used for intra-stack communication, as opposed to inter-stack communication, for which messages are used. Currently, the event module is the only Ensemble-specific module that all layers use. Events contain a well-known set of fields which all layers use for communicating between themselves using a common event protocol. Learning this protocol is one of the harder parts of understanding Ensemble. In this section we describe the operations supported for events; in section 4 we describe the meaning of the various event types and their fields. We repeatedly refer the reader to the source code of the event module source files, both type/event.mli and type/event.ml. This is done to ensure that information in this documentation does not fall out of date due to small changes in the event module. Note that a certain number of the operations invalidate events passed as arguments to the function. This means that no further operations are accessing on the event should be done after the function call. The purpose of this limitation is to allow multiple implementations of the event module with different memory allocation mechanisms. The default implementation of events is purely functional and these rules can be violated without causing problems. Other implementations of the event module require that events be manipulated according to these rules, and yet other implementations trace event modifications to check that the rules are not violated. What this means is that protocol designers do not need to be concerned with allocation and deallocation issues, except in the final stages of development. Currently a reference counting scheme is used for handling message bodies, which form the bulk of memory used in Ensemble. Reference counting is done by-hand, and events that reference Io-vectors must be freed using the free function (see below). The rest of the event is allocated on the ML heap, and is therefore freed automatically by the ML garbage collector.

3.1   Fields

Events are ML records with fixed sets of fields. We refer to type/event.mli for their type definitions and fields.

3.1.1   Extension fields

Events have a special field called the extension field. Uncommon fields are included in up events as a linked list of extensions to this field. The list of valid extensions is defined in type/event.mli by the type definition fields.

3.1.2   Event Types

Events have a ``type'' field (called typ to avoid clashes with the type keyword) in them which can take values from a fixed set of enumerated constants. For the enumerations of the type fields for events, we refer to appl/event.mli for the type definitions for typ.

3.1.3   Field Specifiers

Events have defined for them a variant record called field. These are called field specifiers. There is a one-to-one relation between the fields in up and down events and the variants in the fields specifiers. As will be seen shortly, lists of field specifiers are passed to event constructor and modifier functions to specify the fields in an event to be modified and their values. This allows changes to an event to be specified incrementally.

3.2   Constructors

Events are constructed with the create function.

  (* Constructor *)
val create	: debug -> typ -> field list -> t

Create takes 3 arguments: The return value of the constructor functions is a valid event.

3.3   Special Constructors

type/event.ml defines some special case constructors for either performance or ease-of-coding reasons. All of these constructors are defined using the create function or could be defined using them.

3.4   Modifiers

Events are modified with the set function.

  (* Modifier *)
val set		: debug -> t -> field list -> t

set takes 3 arguments: The return value of set is a new event with the same fields as the original event, except for the changes in the specifier list.

3.5   Copiers

Events are copied with the copy function.

  (* Copier *)
val copy	: debug -> t -> t

Copy takes two arguments: The return value is a new event with its fields set to the same values as the original.

3.6   Destructors

Events are released with the free function.

  (* Destructor *)
val free	: debug -> t -> unit

Free functions takes two arguments: The return value is the unit value.

4   Event protocol: Intra-stack communication

Ensemble embodies two forms of communication. The first is communication between protocol stacks in a group, using messages sent via some communication transport. The second is intra-stack communication between protocol layers sharing a protocol stack (see fig 1), using Ensemble events (see page ?? for a overview of Ensembleevents). One use of events is for passing information directly related to messages (i.e., broadcast messages are usually associated with ECast events). However, events also are used for notifying layers of group membership changes, telling other layers about suspected failed members, synchronizing protocol layers for view changes, passing acknowledgment and stability information, alarm requests and timeouts, etc.... In order for a set of protocol layers to harmoniously implement a higher level protocol, they have to agree on what these various events ``mean,'' and in general follow what is called here the Ensemble event protocol. The layering in Ensemble is advantageous because it allows complex protocols to be decomposed into a set of smaller, more understandable protocols. However, layering also introduces the event protocol which complicates the system through the addition of intra-stack communication (the event protocol) to inter-stack communication (normal message communication). Be aware that this information may become out of date. Although the ``spirit'' of the information presented here is unlikely to change in drastic ways, always consider the possibility that this information does not exactly match that in type/event.ml and type/event.mli. Please alert us when such inconsistencies are discovered so they may be corrected.



Figure 1: Events are used for intra-stack communication: layers can only communicate with other layers by modifying events; layers never read or modify other layer's message headers. Messages are used for inter-stack communication: only messages are sent between group members; events are never sent between members.


The documentation of the event protocol proceeds as follows.

4.1   Event Types




    (* These events should have messages associated with them. *)
  | ECast				(* Multicast message *)
  | ESend				(* Pt2pt message *)
  | ESubCast				(* Multi-destination message *)
  | ECastUnrel				(* Unreliable multicast message *)
  | ESendUnrel				(* Unreliable pt2pt message *)
  | EMergeRequest			(* Request a merge *)
  | EMergeGranted			(* Grant a merge request *)
  | EOrphan				(* Message was orphaned *)

    (* These types do not have messages. *)
  | EAccount				(* Output accounting information *)
(*| EAck			      *)(* Acknowledge message *)
  | EAsync				(* Asynchronous application event *)
  | EBlock				(* Block the group *)
  | EBlockOk				(* Acknowledge blocking of group *)
  | EDump				(* Dump your state (debugging) *)
  | EElect				(* I am now the coordinator *)
  | EExit				(* Disable this stack *)
  | EFail				(* Fail some members *)
  | EGossipExt				(* Gossip message *)
  | EGossipExtDir			(* Gossip message directed at particular address *)
  | EInit				(* First event delivered *)
  | ELeave				(* A member wants to leave *)
  | ELostMessage			(* Member doesn't have a message *)
  | EMergeDenied			(* Deny a merge request *)
  | EMergeFailed			(* Merge request failed *)
  | EMigrate				(* Change my location *)
  | EPresent                            (* Members present in this view *)
  | EPrompt				(* Prompt a new view *)
  | EProtocol				(* Request a protocol switch *)
  | ERekey				(* Request a rekeying of the group *)
  | ERekeyPrcl				(* The rekey protocol events *)
  | ERekeyPrcl2				(*                           *)
  | EStable				(* Deliver stability down *)
  | EStableReq				(* Request for stability information *)
  | ESuspect				(* Member is suspected to be faulty *)
  | ESystemError			(* Something serious has happened *)
  | ETimer				(* Request a timer *)
  | EView				(* Notify that a new view is ready *)
  | EXferDone				(* Notify that a state transfer is complete *)
  | ESyncInfo
      (* Ohad, additions *)
  | ESecureMsg				(* Private Secure messaging *)
  | EChannelList			(* passing a list of secure-channels *)
  | EFlowBlock				(* Blocking/unblocking the application for flow control*)
(* Signature/Verification with PGP *)
  | EAuth

  | ESecChannelList                     (* The channel list held by the SECCHAN layer *)
  | ERekeyCleanup
  | ERekeyCommit 



Figure 2: Event typ type definition. Taken from type/event.mli.


This section describes the different types of events. See fig 2 for the source code of enumerated types. The behavior of a layer depends not only on the event type and its fields, but also on the direction from which it arrives. For example, an ESend event travels in the sender stack from the application down, and at the receiver from the bottom, up to the application. The sender and receiver layers behave quite differently depending on whether the message is sent or received. In what follows, we sometimes specifically include the event direction. Detailed Descriptions:

4.2   Event fields

Here we describe all the fields that appear in the events. The type definitions appear in fig ?? and fig 3. Default values for the fields appear in fig ??.


type field =
      (* Common fields *)
  | Type        of typ            (* type of the message*)
  | Peer        of rank           (* rank of sender/destination *)
  | Iov	        of Iovecl.t       (* payload of message *)
  | ApplMsg                       (* was this message generated by an appl? *)

      (* Uncommon fields *)
  | Address     of Addr.set	  (* new address for a member *)
  | Failures    of bool Arrayf.t  (* failed members *)
  | Presence    of bool Arrayf.t  (* members present in the current view *)
  | Suspects    of bool Arrayf.t  (* suspected members *)
  | SuspectReason of string	  (* reasons for suspicion *)
  | Stability   of seqno Arrayf.t (* stability vector *)
  | NumCasts    of seqno Arrayf.t (* number of casts seen *)
  | Contact     of Endpt.full * View.id option (* contact for a merge *)

      (* HEAL gossip *)  
  | HealGos     of Proto.id * View.id * Endpt.full * View.t * Hsys.inet list
  | SwitchGos   of Proto.id * View.id * Time.t  (* SWITCH gossip *)
  | ExchangeGos	of string		(* EXCHANGE gossip *)

      (* INTER gossip *)
  | MergeGos    of (Endpt.full * View.id option) * seqno * typ * View.state
  | ViewState   of View.state	(* state of next view *)
  | ProtoId     of Proto.id	(* protocol id (only for down events) *)
  | Time        of Time.t	(* current time *)
  | Alarm       of Time.t	(* for alarm requests *)
  | ApplCasts   of seqno Arrayf.t
  | ApplSends   of seqno Arrayf.t
  | DbgName     of string

      (* Flags *)
  | NoTotal                     (* message is not totally ordered*)
  | ServerOnly	                (* deliver only at servers *)
  | ClientOnly	                (* deliver only at clients *)
  | NoVsync
  | ForceVsync
  | Fragment	                (* Iovec has been fragmented *)

      (* Debugging *)
  | History     of string       (* debugging history *)

      (* Private Secure Messaging *)
  | SecureMsg of Buf.t
  | ChannelList of (rank * Security.key) list
	
      (* interaction between Mflow, Pt2ptw, Pt2ptwp and the application *)
  | FlowBlock of rank option * bool

      (* Signature/Verification with Auth *)
  | AuthData of Addr.set * Auth.data

      (* Information passing between optimized rekey layers *)
  | Tree    of bool * Tree.z
  | TreeAct of Tree.sent
  | AgreedKey of Security.key

      (* The channel list held by the SECCHAN layer *)
  | SecChannelList of Trans.rank list
  | SecStat of int              (* PERF figures for SECCHAN layer *)
  | RekeyFlag of bool           (* Do a cleanup or not *)



Figure 3: Fields for events. Taken from type/event.mli


4.2.1   Event Fields

4.3   Event fields and the ``types'' for which they are defined

[TODO]

4.4   Event Chains

We describe here common event sequences, or chains, in Ensemble. Event chains are sequences of alternate up and down events that ping-pong up and down the protocol stack bouncing between the two end-layers of the chain. The end layers are typically the the top and bottom-most layers in the stack (eg., TOP and BOT). The most common exceptions to this are the message chains (Sends and Broadcasts), which can have any layer for their top layer. Note that these sequences are just prototypical. Necessarily, there are variations in which of layers see which parts of these sequences. For example, consider the Failure Chain in a virtual synchrony stack with the GMP layer. The Failure Chain begins at the coordinator with an ESuspect event initiated at any layer in the stack. The BOT layer bounces this up as an ESuspect event. The top-most layer usually responds with a EFail event. The EFail event passes down through all the layers until it gets to the GMP layer. The GMP layer at the coordinator both passes the EFail event to the layer below and passes down a ECast event (thereby beginning a Broadcast Chain...). At the coordinator, the EFail event bounces off of the BOT layer as an EFail event and then passes up to the top of the protocol stack. At the other members, an ECast event will be received at the GMP layer. The message is marked as a ``Fail'' message, so the GMP layers generate and send down an EFail event (just like the one at the coordinator) and this is also bounced off the BOT layer as an EFail event. The lesson here is that the different layers in the different members of the group all essentially saw the same Failure Chain, but exact sequencing was different. For example, the layers above the GMP layer at the members other than the coordinator did not see a EFail event. [TODO: give diagram] [TODO: Leave Chain]

4.4.1   Timer Chain

Request for a timer, followed by an alarm timeout.
ETimer down: timeout requested, sent down to BOT.
ETimer up: alarm generated in BOT at or after requested time, and sent up.

4.4.2   Send Chain

Send a pt2pt message followed by stability detection.
ESend down: send a pt2pt message down.
ESend up: destinations receive the message
EStable message eventually becomes stable, and stability information is bounced off BOT.

4.4.3   Broadcast Chain

Broadcast of a message followed by stability detection.
ECast down: broadcast a message
ECast up: other members receive the broadcast
EStable broadcast eventually becomes stable, and stability information is bounced off BOT

4.4.4   Failure Chain

Suspicion and ``failure'' of group members.
ESuspect down: suspicion of failures generated at any layer
ESuspect up: notification of suspicion of failures
EFail down: coord fails suspects
EFail up: all members get failure notice

4.4.5   Block Chain

Blocking of a group prior to a membership change.
ESuspect/EMergeRequest up: reasons for coord blocking
EBlock down: coord starts blocking
EBlock up: all members get block notice
EBlockOk down: all members reply to block notice
EBlockOk up: coord get block OK notice
EMergeRequest EView down: coord begins Merge or View chain

4.4.6   View Chain

Installation of a new view, followed by garbage collection of the old view.
EView down: coord begins view chain (after failed merge or blocking)
EView up: all members get view notice
EExit down: protocol stacks are ready for garbage collection [todo]
EExit up: protocol stacks are garbage collected

4.4.7   Merge Chain (successful)

Partition A merges with partition B, followed by garbage collection of the old view. We focus on partition A and only give a subset of events in partition B.
EMergeRequest down: coord A begins merge chain (after blocking)
EMergeRequest up: coord B gets merge request
EMergeGranted down: coord B replies to merge request
EMergeGranted up: coord A gets merge OK notice
EView down: coord A installs new view for coord B
EView up: all members in group A get view notice
EExit down: protocol stacks are ready for garbage collection
EExit up: protocol stacks are garbage collected
[TODO: EExit above is currently ELeave]

4.4.8   Merge Chain (failed)

Failed merge, followed by installation of a view.
EMergeRequest down: coord begins merge chain (after blocking)
EMergeFailed or  
EMergeDenied up: coord detect merge problem
EView down: coord begins view chain

5   Layer Execution Model

5.1   Callbacks

Layers are implemented as a set of callbacks that handle events passed to the layer by Ensemble from some other protocol layer. These callbacks can in turn call callbacks that Ensemble has provided the layer for passing events onto other layers. Logically, a layer is initialized with one callback for passing events to the layer above it, and another callback for passing events to the layer below it. After initialization, a layer returns two callbacks to the system: one for handling events from the layer below it and another for handling events from the layer above it. In practice, these ``logical callbacks'' are subdivided into several callbacks that handle the cases where different kinds of messages are attached to events.



Figure 4: Layers are executed as I/O automata, with pairs FIFO event queues connecting adjacent layers.


5.2   Ordering Properties

The system infrastructure that handles scheduling of protocol layers and the passing of events between protocol layers provides the following guarantees:
[FIFO ordering] : The infrastructure guarantees that events passed between two layers are delivered in order. For instance, if layer A is stacked above layer B, then all events layer A passes to layer B are guaranteed to be delivered in FIFO order to layer B. In addition, events that layer B passes up to layer A are guaranteed to be delivered in FIFO order to layer A. Note that these ordering properties allow the scheduler some flexibility in scheduling because they only specify the ordering of events in a single channel between a pair of layers.
[no concurrency] : The sytem infrastructure that hands events to layers through the callbacks never invokes a layer's callbacks concurrently. It guarantees that at most one invocation of any callback is executing at a time and that the current callback returns before another callback is made to the protocol layer. See fig 4 for a diagram of layer automata. Note that although a single layer may not be executed concurrently, different layers may be executed concurrently by a scheduler.
The execution of a protocol stack can be visualized as a set of protocol layers executing with a pair of event queues between each pair: one queue for events going up and another for events going down. The protocol layers are then automata that repeatedly are scheduled to take pending events from one of the adjacent incoming queues, execute it, and then deposit zero or more events into two adjacent outgoing queues (see fig 4).

6   Layer Anatomy: what are the pieces of a layer?

This is a description of the standard pieces of a Ensemble layer. This description is meant to serve as a general introduction the standard ``idioms'' that appear in layers. Because all layers follow the same general structure, we present a single documentation of that structure, so that comments in a layer describe what is particular to that layer rather than repeating the features each has in common with all the others. Comments on additional information that would be useful here would be appreciated.

6.1   Design Goals

A design goal of the protocol layers is to include as little Ensemble-specific infrastructure is present in the layers. For instance, none of the layers embody notions of synchronization, messages operations, of event scheduling. In fact, the only Ensemble-specific modules used by layers are the Event and the View modules.

6.2   Notes

Some general notes on layers:

6.3   Values and Types

Listed below are the values and types commonly found in a layer, listed in the usual order of occurrence. For each object we give a description of its typical use and whether or not it is exported out of the layer.

7   Event Handlers: Standard

Logically, a protocol has two incoming event handlers (one each above and below) and two outgoing event handlers (one each above and below). In practice, because some events have messages and others do not, these handlers are split up into several extra handlers. The breakdown of the 4 logical handlers into 10 actual handlers is done for compatibility with the ML typechecker. Typechecking is used extensively to guarantee that layers recieve messages of the same type they send. This is a very useful property because it prevents a large class of programming errors.

name in/ up/ above/ message? header?
out dn below
upnm out up above no no
up out up above yes no
dnnm out dn below no no
dnlm out dn below no yes
dn out dn below yes yes
upnm in up below no no
uplm in up below no yes
up in up below yes yes
dnnm in dn above no no
dn in dn above yes no


Table 1: The 10 standard event handlers.





Figure 5: Diagram of the 10 standard event handlers. Note that the ABOVE layer has a similar interface above it as the BELOW layer. Likewise with the interface beneath the BELOW layer.


In the standard configuration, each layer has 10 handlers. A handler is uniquely specified by a set of characteristics: whether it is an incoming or outgoing handler, a handler for up events or down events, a handler for communication with the layer above or for the layer below, whether it has an associated message, and whether it has an associated header. See table 1 for a enumeration of the 10 handlers. Of the 10 handlers, 5 are outgoing and 5 are incoming; 5 are up event handlers and 5 are down event handlers; 4 are for event communication with the layer below and 6 are for event communication with the layer above. These are depicted in fig 5. The names of the handlers have two parts. The first specifies the sort of event the handler is called with (``up'' or ``dn''). The second specifies the sort of message that is associated with the event and may be either ``'' (nothing, the default case), ``lm'' (for local message), or ``nm'' (for no message), which correspond to:
[nothing:] Events with associated messages, where the message was created by a layer above this layer. This layer was not the first layer to push a header onto the message and will not be the last layer to pop its header off the message.
[``lm'':] Events with associated messeges, where the message was created by this layer. This was the first layer to push a header onto the message and is the last layer to pop its header off of the message.
[``nm'':] Corresponds to events without associated messages. These handlers always take a single argument which is either an up event or a down event.

8   The Ensemble Security Architecture (by Ohad Rodeh)

This section describes the Ensemble security architecture. We believe that Ensemble completely supports the fortress security model. Only trusted, authorized members are allowed into the group. Once a member is allowed into a group, it is completely trusted. Ensemble is not secure against attacks from members that have been admitted into the group: any group member can break the protocols by sending bad messages. The goal of our architecture is to secure group messages from tampering and eavesdropping. To this end, all group messages are signed and (possibly) encrypted. While it is possible to use public key cryptography for this task, we find this approach unacceptably expensive. Since all group members are mutually trusted, we share a symmetric encryption key, and a MAC 1 key among them. These keys are used to seal all group messages, making the seal/unseal operation very fast2. As a shorthand, we shall refer to the key-pair as the group key. Using a group key raises two challenges:
A rekeying mechanism:
allowing secure replacement of the current group key once it is deemed insecure, or if there is danger that it was leaked to the adversary. Dissemination of the new key should be performed without relying on the old (compromised) group key.
Secure key agreement in a group:
i.e., a protocol that creates secure agreement among group members on a mutual group key.
We focus on benign failures and assume that authenticated members will not be corrupted. Byzantine fault tolerant systems suffer from poor performance since they use costly protocols and make extensive use of public key cryptography. We believe that our failure model is sufficient for the needs of most practical applications. The user may specify a security policy for an application. The policy specifies for each address3 whether it is trusted or not. Each application maintains its own policy, it is up to Ensemble to enforce it and to allow only mutually trusted members into the same subgroup. A policy allows an application to specify the members that it trusts and exclude untrusted members from its subgroup.

8.1   Cryptographic Infrastructure

Our design supports the use of a variety of authentication and encryption mechanisms. Ensemble has been interfaced with the OpenSSL (see http://www.openssl.org/) cryptographic library, the PGP authentication engine, and the Kerberos centralized authentication system (this is out of date). By default, messages are signed using MD5, encrypted using RC4, and authentication is performed using PGP. Because these three functionalities are carried out independently any combination of supported authentication, signature, and encryption systems can be used. A future goal is to allow multiple systems to be supported concurrently. Under such a system, processes would be able to compare the systems they have support for and select any system that both have support for.

8.2   Rekeying

Ensemble rekeying uses the notion of secure channels. A secure channel between endpoints p and q is essentially a symmetric encryption key kpq agreed upon between p and q. This key is known only to p and q and is different than the group key. Whenever confidential information needs to be passed between p and q it is encrypted using kpq and sent using Ensemble reliable point-to-point messaging. The basic rekeying protocol supported uses a binary tree structure. In order to rekey the group, a complete binary tree spanning the group is created. Member 0 is the father of 1 and 2, 1 is the father of 3 and 4, etc.. The leader chooses a new key knew and sends it securely to 1 and 2; member number 1 sends knew securely down to 3 and 4, etc.. When a tree leaf receives a new key it sends up a clear-text acknowledgment. When acknowledgments reach the leader (0) it prompts the group for a view change in which the new key will be used. knew is disseminated confidentially using secure channels. We cannot use the old key to protect knew since the old key is assumed to be compromised. Secure channels are created upon demand by Ensemble, they are then cached for future use. Creating a secure channel is a costly operation taking hundreds of milliseconds even on fast CPUs. It is performed in the background so as not to block the application. Recently, we have added faster rekeying protocols to the system. A complete implementation of the dWGL algorithm has been added, in the form of several layers. There are two new algorithms rekey_dt, and rekey_diam. There are described in the reference manual.

8.3   A secure stack

The Security architecture is comprised of 5 layers:
[Exchange:] secure key agreement. This layer is responsible for securely handing the group key to new joining group components. Component leaders mutually authenticate and check authorization policies prior to handing the group key securely between them.
[Encrypt:] chain-encryption of all user messages.
[Secchan:] create and manage a cache of secure channels.
[PerfRekey:] handling common rekeying tasks. For example, after a new key has been disseminated to the group, acknowledgments must be collected from all group members.
[Rekey_dt:] Binary tree rekeying. Rekeying a group is very fast once secure channels have been setup. We logged an average rekey operation for a 20 member group at 100 milliseconds. Rekey_dt assumes that the Secchan and PerfRekey layers are in the stack.
The regular and secure Ensemble stacks are depicted in Figure 2. The Top and Bottom layer cap the stack from both sides. The membership layers compute the current set of live and connected machines, the Appl_top layer interfaces with the application and provides reliable send and receive capabilities for point-to-point and multicast messages. The RFifo layers provide reliable per-source fifo messaging. The Exchange and Rekey layers are related to the membership layers since the group key is a part of the view information. The Encrypt layer encrypts all user messages hence it is below the Appl_top layer.

Regular security additions
Top
Exchange
Rekey_dt
PerfRekey
Secchan
Gmp  
Top_appl Interface to the application
Encrypt
Rfifo
Bottom


Table 2: The Ensemble stack. On the left is the default stack that includes an application interface, the membership algorithm and a reliable-fifo module. To the right is a secure stack with the Exchange, Encrypt, Rekey_dt, and Secchan layers in place.


8.4   Security events

There are three security events to note: The Vs_key field was added to the view state was to allow for group keys. It holds the current group key.

8.5   Using Security

Ensemble has three security properties:
  1. Rekey: Add rekeying to the stack.
  2. OptRekey: Use the dWGL algorithm for rekeying.
  3. Auth: Authenticate all messages.
  4. Privacy: Encrypt all user messages.
An application wishing for strong security should choose all of the above properties in its stack and perform a Control Rekey action once every several hours. Note that there are two flavors to application Rekey-ing: An example command line, for application appl, with pgp user name James_Joyce:

appl -add_prop Auth -add_prop Privacy -key 01234567012345670123456701234567 
     -pgp James_Joyce

In order to add authorization to the stack, thereby controlling which members are allowed to join a group, one must do:

  val policy_function : Addr.set -> bool
  val interface : Appl_intf.New.t

  let state = Layer.new_state interface in
  let state = Layer.set_exchange (Some policy_function) state in 
  Appl.config_new_full state (ls,vs)

Instead of simply:

  Appl.config_new interface state (ls,vs)

Authorization is not linked to the Security architecture, regular stacks can perform authorization. Control of joining members is delegated to the group leader that checks its authorization list and allows/disallows join. Every view change the authorization list is checked and existing members that are not authorized are removed. In practice, if an application changes its authorization list dynamically, it must perform a Prompt and a Rekey whenever such a change occurs.

8.6   Checking that things work

To check that PGP has been installed correctly, that Ensemble can talk to it without fault, and the cryptographic support is running correctly, one can use the armadillo demo program. In order to set up PGP, one must create principals and corresponding public and private keys. These are installed by PGP in its local key repository. The basic PGP key-generation command is:

zigzag ~/ensemble/demo> pgp -kg

To work with the armadillo demo, you'll need to create principals in the group o1, o2, .... Armadillo creates a set of endpoints, and then runs a test between them. To this end, the program has a ``-n'' flag that describes the number of endpoints to use. For example, the command line armadillo -n 2 ... tells armadillo that use a two members configuration. These members will have principal names o1 and o2 respectively. To view the set of principals in the repository do:

zigzag ~/ensemble/demo> pgp -kv
pub   512/2F045569 1998/06/15 o2
pub   512/A2358EED 1998/06/15 o1
2 matching keys found.

To check that PGP runs correctly do:

zigzag ~/ensemble/demo> armadillo -prog pgp 
PGP works
check_background
got a ticket
background PGP works

If something is broken, the PGP execution trace can be viewed using:

zigzag ~/ensemble/demo> armadillo -prog pgp  -trace PGP 

If more information is required use the flags -trace PGP1 -trace PGP2. The default version of PGP that Ensemble works with is 2.6. If, however, you'd like to use a different version, set your environment variable ENS_PGP_VERSION to the version number. Versions 5.0 and 6.5 are also supported. To check that OpenSSL is installed correctly, one can do:

zigzag ~/ensemble/demo> armadillo -prog perf

For a wider scale test use the exchange test. This is a test that creates a set of endpoints, with principal names: o1, o2, ..., and merges them securely together into one group. Each group merge requires that group-leaders properly authenticate themselves using PGP. The test is started with all members in components containing themselves, and ends when a single secure component is created. Note that it will keep running until reaching the timeout. The timeout is set by default to 20 seconds. To invoke the test do:

zigzag ~/ensemble/demo> armadillo -prog exchange -n 2 -real_pgp

If something goes wrong, a trace of the authentication protocol is available through -trace EXCHANGE. The -real_pgp flag tells armadillo not to simulate PGP. Simulation is the default mode for armadillo, since we use it to test communication protocol correctness. To check that rekeying works do:

zigzag ~/ensemble/demo> armadillo -prog rekey  -n 5

To test security with two separate processes do the following:

zigzag ~/ensemble/demo> gossip &
zigzag ~/ensemble/demo> mtalk -key 11112222333344441111222233334444 
                  -add_prop Auth -pgp o1
zigzag ~/ensemble/demo> mtalk -key 01234567012345670123456701234567 
                 -add_prop Auth -pgp o2

The two mtalk processes should authenticate each other and merge. The three command line arguments specify:

8.7   Using security from HOT and EJava

The security options have been added to the HOT interface. For a demonstration program look at hot_sec_test.c in the hot subdirectory. The only steps one needs to make are: (1) Set the program's principal name (2) Set the security bit. Both of these options are specified in the join-options structure. For example, in hot_sec_test.c:

static void join(
		 int i,
		 char **argv
) 
  state *s ;
  s = (state *) hot_mem_Alloc(memory, sizeof(*s)) ;
  memset(s,0,sizeof(*s)) ;
  
  s->status = BOGUS;
  s->magic = HOT_TEST_MAGIC;

  ...

  strcpy(s->jops.transports, "UDP");
  strcpy(s->jops.group_name, "HOT_test");
  
  ...

  sprintf(s->jops.princ, "Pgp(o%d)",i);
  s->jops.secure = 1;

  ...
  
  /* Join the group.
   */
  err = hot_ens_Join(&s->jops, &s->gctx);
  if (err != HOT_OK) 
    hot_sys_Panic(hot_err_ErrString(err));
  



EJava is interfaced with HOT, so they share a similar interface. Note that the outboard mode, supported by both interface is insecure. The messages passing on the TCP connection between the client and server are neither MACed nor encrypted. Therefore, they can be used securely only when situated on a single machine.

Part II
The Ensemble Protocols

9   Layers and Stacks

We document a subset of the Ensemble layers and stacks (compositions of layers) in this section. This documentation is intended to be largely independent of the implementation language. They are currently listed in order, bottom-up, of their use in the VSYNC layer. Each layer (or stack) has these items in its documentation:

9.1   ANYLAYER

The name of the layer follwed by a general description of its purpose.
Protocol
 

A description of the protocol implemented by the layer.
Parameters
 

Properties
 

Notes
 

Sources
 

The source files for the ML implementation of the layer.
Generated Events
 

A list of event types generated by the layer. In the future, this field will contain more information, such as what event types are examined by the layer (instead of being blindly passed on). Hopefully, this information will eventually be generated automatically.
Testing
 

9.2   CREDIT

This layer implements a credit based flow control.
Protocol
 

On initialization, sender informs receivers how many credits it wants to keep in stock. Receivers sends credits whenever it finds that the sender is low on credits, either explicitly through a sender's request or implicitly through its local accounting. A credit is one time use only. Sender is allowed to send a message only if it has a credit available. If the sender does not have a credit, the message is buffered. Buffered messages are sent when new credits arrive. Credits are piggybacked to data messages whenever there is an opportunity of doing so to save bandwidth.
Parameters
 

Notes
 

Sources
 

layers/credit.ml
Last updated: Fri Mar 29, 1996

9.3   RATE

This layer implements a sender rate based flow control. Multicast messages from each sender are sent at a rate not exceeding some prescribed value.
Protocol
 

All the messages to be sent are buffered initially. Buffered messages are sent on periodic timeouts that are set based on the sender's rate.
Parameters
 

Notes
 

Sources
 

layers/rate.ml
This layer and its documentation were written by Takako Hickey.

9.4   BOTTOM

Not surprisingly, the BOTTOM layer is the bottommost layer in a Ensemble protocol stack. It interacts directly with the communication transport by sending/receiving messages and scheduling/handling timeouts. The properties implemented are all local to the protocol stack in which the layer exists: ie., a (dn)Fail event causes failed members to be removed from the local view of the group, but no failure message to be sent out--it is assumed that some other layer actually informs the other members of the failure.
Protocol
 

None
Parameters
 

Properties
 

Sources
 

layers/bottom.ml
Generated Events
 

Up(EBlock)
Up(ECast)
Up(EExit)
Up(EFail)
Up(EStable)
Up(EMergeDenied)
Up(EMergeGranted)
Up(EMergeRequest)
Up(ESend)
Up(ESuspect)
Up(ETimer)
Up(EView)
Testing
 

9.5   CAUSAL

The CAUSAL layer implements causally order multicast. It assumes reliable, FIFO ordered reliable messaging from layers below.
Protocol
 

The protocol has two versions: full and compressed vectors. First, we explain the simple version which uses full vectors. Then, we explain how these vectors are compressed. Each outgoing message is appended with a causal vector. This vector contains the last causally delivered message from each member in the group. Each received message is checked for deliverability. It may be delivered only if all messages which it causally follows, according to its causal vector, have been delivered. If it is not yet deliverable, it is delayed in the layer until delivery is possible. A view change erases all delayed messages, since they can never become deliverable. Causal vectors become large with the group size, so they must be compressed in order for this protocol to scale. The compression we use is derived from the Transis system. We demonstrate with an example: assume the membership includes three processes p,q and r. Process p sends message mp,1, q sends mq,1, causally following mp,1 and r sends mr,1 causally following mq,1. The causal vector for mr,1 is [1|1|1]. There is redundancy in the causal vector since it is clear that mr,1 follows mr,0. Furthermore, since mq,1 follows mp,1 we may omit stating that mr,1 follows mp,1. To conclude, it suffices to state that mr,1 follows mq,1. Using such optimizations causal vectors may be compressed considerably.
Sources
 

layers/causal.ml
Testing
 

This layer and its documentation were written by Ohad Rodeh.

9.6   ELECT

This layer implements a leader election protocol. It determines when a member should become the coordinator. Election is done by delivering an Dn(EElect) event at the new coordinator.
Protocol
 

When a member suspects all lower ranked members of being faulty, that member elects itself as coordinator.
Parameters
 

Properties
 

Sources
 

layers/elect.ml
Generated Events
 

Dn(EElect)
Testing
 

9.7   ENCRYPT

This layer encrypts application data for privacy. Uses keys in the view state record. Authentication needs to be provided by the lower layers in the system. The protocol headers are not encrypted. This layer must reside above FIFO layers for sending and receiving because it uses encryption contexts whereby the encryption of a message is dependent on the previous messages from this member. These contexts are dropped at the end of a view. A smarter protocol would try to maintain them, as they improve the quality of the encryption.
Protocol
 

Does chained encryption on the message payload in the iov field of events. Each member keeps track of the encryption state for all incoming and outgoing point-to-point and multicast channels. Messages marked Unreliable are not encrypted (these should not be application messages).
Parameters
 

Properties
 

Sources
 

layers/encrypt.ml
Generated Events
 

None
Testing
 

9.8   HEAL

This protocol is used to merge partitions of a group.
Protocol
 

The coordinator occasionally broadcasts the existence of this partition via Dn(EGossipExt) events. These are delivered unreliably to coordinators of other partitions. If a coordinator decides to merge partitions, then it prompts a view change and inserts the name of the remote coordinator in the Up(EBlockOk) event. The INTER protocol takes over from there. Merge cycles are prevented by only allowing merges to be made from smaller view id's to larger view id's.
Parameters
 

Properties
 

Sources
 

layers/heal.ml
Generated Events
 

Up(EPrompt)
Dn(EGossipExt)
Testing
 

9.9   INTER

This protocol handles view changes that involve more than one partition (see also INTRA).
Protocol
 

Group merges are the more complicated part of the group membership protocol. However, we constrain the problem so that: The merge protocol works as follows:
  1. The merging coordinator blocks its group,
  2. The merging coordinator sends a merge request to the remote group's coordinator.
  3. The remote coordinator blocks its group,
  4. The remote coordinator installs a new view (with the mergers in it) and sends the view to the merging coordinator (through a merge-granted message).
  5. The merging coordinator installs the view in its group.
If the merging coordinator times out on the merged coordinator then it immediately installs a new view in its partition (without the other members even finding out about the merge attempt).
Parameters
 

Properties
 

Sources
 

layers/inter.ml
Generated Events
 

Dn(EMerge)
Dn(EMergeDenied)
Dn(ESuspect)
Testing
 

9.10   INTRA

This layer manages group membership within a view (see also the INTER layer). There are three related tasks here:
Protocol
 

This is a relatively simple group membership protocol. We have done our best to resist the temptation to ``optimize'' special cases under which the group is ``unnecessarily'' partitioned. We also constrain the conditions under which operations such as merges can occur. The implementation does not ``touch'' any data messages: it only handles group membership changes. Furthermore, this protocol does not use any timeouts. Views and failures are forwarded via broadcast to the rest of the members. Other members accept the view/failure if they are consistent with their current representation of the group's state. Otherwise, the view/failure message is dropped and the sender is suspected of being problematic.
Parameters
 

Properties
 

Sources
 

layers/intra.ml
Generated Events
 

Dn(ECast)
Dn(EFail)
Dn(ESuspect)
Dn(EView)
Testing
 

9.11   LEAVE

This protocol has two tasks. (1) When a member really wants to leave a group, the LEAVE protocol tells the other members to suspect this member. (2) The leave protocol garbage collects old protocol stacks by initiating a Dn(ELeave) after getting an Up(EView) and then getting an Up(EStable) where everything is marked as being stable.
Protocol
 

Both protocols are simple. For leaving the group, a member broacasts a Leave message to the group which causes the other members to deliver a Dn(ESuspect) event. Note that the other members will get the Leave message only after receiving all the prior broadcast messages. This member should probably stick around, however, until these messages have stabilized. Garbage collection is done by waiting until all broadcast message are stable before delivering a local Dn(ELeave) event.
Parameters
 

Properties
 

Sources
 

layers/leave.ml
Generated Events
 

Dn(ELeave)
Testing
 

9.12   MERGE

This protocol provides reliable retransmissions of merge messages and failure detection of remote coordinators when merging.
Protocol
 

Simple retransmission protocol. A hash table is used to detect copied merge requests, which are dropped.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/merge.ml
Generated Events
 

Up(ESuspect)
Dn(EMerge)
Dn(ETimer)
Testing
 

9.13   MFLOW

This layer implements window-based flow control for multicast messages. Multicast messages from each sender are transmitted only if the number of send credit left is greater than zero. The protocol attempts to avoid situations where all recievers send credit at the same time, so that a sender is not flooded with credit messages.
Protocol
 

Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/mflow.ml
Testing
 

This layer and its documentation were written with Zhen Xiao.

9.14   MNAK

The MNAK (Multicast NAK) layer implements a reliable, agreed, FIFO-ordered broadcast protocol. Broadcast messages from each sender are delivered in FIFO-order at their destinations. Messages from live members are delivered reliably and messages from failed members are retransmitted by the coordinator of the group. When all failed members are marked as such, the protocol guarantees that eventually all live members will have delivered the same set of messages.
Protocol
 

Uses a negative acknowledgment (NAK) protocol: when messages are detected to be out of order (or the NumCast field in an Up(EStable) event detects missing messages), a NAK is sent. The NAK is sent in one of three ways, chosen in the following order:
  1. Pt2pt to the sender, if the sender is not failed.
  2. Pt2pt to the coordinator, if the reciever is not the coordinator.
  3. Broadcast to the rest of the group if the receiver is the coordinator.
All broadcast messages are buffered until stable.
Parameters
 

Properties
 

Sources
 

layers/mnak.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.15   OPTREKEY

This layer is part of the dWGL suite. Together with RealKeys, it implements the dWGL rekeying algorithm. The specific task performed by OptRekey is computing the new group keygraph.

(a)
(b) (c)


Figure 6: The effect of leave on a group key-graph of a group G of eight members. (1) The initial keygraph. (2) The tree after member p1 leaves. (3) The merged tree.


Briefly, a keygraph is a graph where all group members form the leaves, and the inner nodes are shared sub-keys. A member knows all the keys on the route from itself to the root. The top key is the group key, and it is known by all members. For example, Figure 6(a) depicts a group G of eight members {p1 ... p8} and their subkeys. When a member leaves the group, all the keys known to it must be discarded. This splits a group into a set of subtrees. Figure 6(b) shows G after member p1 has left. In order to re-merge the group keygraph, the subtrees should be merged. This can be seen in Figure 6(c). A subleader is the leader of a subtree. In our example, member p2 is the leader of {p2}, p3 is the leader of {p3,p4}, and p5 is the leader of {p5,p6,p7,p8}.
Protocol
 

This layer is activated upon a Rekey action. The leader receives an ERekeyPrcl event, and starts the OptRekey protocol. Typically, a Rekey will follow a join or a leave. Hence, the group keygraph is initially fragmented. This layer's task is to remerge it. The protocol employed is as follows:
  1. The leader multicasts Start.
  2. Subleaders send their keygraphs to the leader.
  3. The leader computes an optimal new keygraph.
  4. The leader multicasts the new keygraph.
  5. Members receive the keygraph and send it up using a ERekeyPrcl event to the RealKeys layer.
An optimal keygraph is complex to compute, an auxiliary module is used for this task. Note that OptRekey is designed so that only subleaders participate. In the normal case, where a single member joins or leaves, this will include log2n members. It is possible that a Rekey will be initiated even though membership hasn't changed. This case is specially handled, since it can be executed with nearly no communication.
Properties
 

Sources
 

layers/optrekey.ml
layers/util/tree.ml,mli
layers/type/tdefs.ml,mli
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.16   PERFREKEY

This layer is responsible for common management tasks related to group rekeying. Above PerfRekey, a rekeying layer is situated. At the time of writing there are four options: Rekey, RealKeys+OptRekey, Rekey_dt, and Rekey_diam. The Rekey layer implements a very simple rekeying protocol, RealKeys and OptRekey layers together implement the dWGL protocol. Rekey_dt implements a dynamic-tree based protocol, and Rekey_diam uses a diamond-like graph.
Protocol
 

The layer comes into effect when a Rekey operation is initiated by the user. It is bounced by the Bottom layer as a Rekey event and received at PerfRekey. From this point, following protocol is used:
  1. The Rekey action is diverted to the leader.
  2. The leader initiates the rekey sequence by passing the request up to Rekey/OptRekey/Rekey_dt/Rekey_diam.
  3. Once rekeying is done, the members pass a RekeyPrcl event with the new group key back down.
  4. PerfRekey logs the new group key. A tree spanning the group is computed through which acks will propagate. The leaves sends Acks up the tree.
  5. When Acks from all the children are received at the leader, it prompts the group for a view change.
In the upcoming view, the new key will be installed. Another rekeying flavor includes a Cleanup stage. Every couple of hours, the set of cached secure channels, and other key-ing material should be removed. This prevents an adversary from using cryptanalysis to break the set of symmetric keys in use by the system. To this end, PerfRekey supports an optional cleanup stage prior to actual rekeying. This is a sub-protocol that works as follows:
  1. The leader multicasts a Cleanup message.
  2. All members remove all their cached key-material from all security layers. A ERekeyCleanup event is sent down to Secchan, bounced up to Rekey/OptRekey+RealKeys/.., and bounced back down to PerfRekey.
  3. All members send CleanupOk to the leader through the Ack-tree.
  4. When the leader receives CleanupOk from all the members, it starts the Rekey protocol itself.
By default, cleanup is perform every 24hours. This is a settable parameter that the application can decide upon. Rekeying may fail due to member failure or due to a merge that occurs during the execution. In this case, the new key is discarded and the old key is kept. PerfRekey supports persistent rekeying: when the 24hour timeout is over, a rekey will ensue no-matter how many failures occur. The Top layer checks that all members in a view a trusted. Any untrusted member is removed from the group through a Suspicion event. Trust is established using the Exchange layer, and the user access control policy.
Properties
 

Parameters
 

Sources
 

layers/perfrekey.ml
Generated Events
 

EPrompt
ERekeyPrcl
Dn(ECast)
Dn(ESend)
Testing
 

9.17   PRIMARY

Detect primary partition in a group. Usually a primary partition has the majority of members or holds some important resources.
Protocol
 

Upon Up(EInit) event, a member sends a message to the coordinator, claiming that it is in the current view. When a view has the majority of members, its coordinator prompts a view change to make itself the primary partition if it is not yet. When a new view is ready, it decides whether it is primary and mark it as so.
Parameters
 

Properties
 

Sources
 

layers/primary.ml
Generated Events
 

Dn(EPrompt)
Dn(ESend)
Testing
 

This layer and its documentation were written with Zhen Xiao.

9.18   PT2PT

This layer implements reliable point-to-point message delivery. [TODO: finish this documentation]
Parameters
 

Testing
 

9.19   PT2PTW

This layer implements window-based flow control for point to point messages. Point-to-point messages from each sender are transmitted only if the window is not yet full.
Protocol
 

Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit. Acknowledgements are sent whenever a speicified threshhold is passed.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/pt2ptw.ml
Testing
 

Last updated: March 21, 1997 '

9.20   PT2PTWP

This layer implements an adaptive window-based flow control protocol for point-to-point communication between the group members. In this protocol the receiver's buffer space is shared between all group members. This is accomplished by dividing the receiver's window among the senders according to the bandwidth of the data being received from each sender. Such way of sharing attempts to minimize the number of ack messages, i.e. to increase message efficiency.
Protocol
 

In the following, the term acknowledgement is used with the meaning of flow control protocols and not that of reliable communication protocols. This protocol uses credits to measure the available buffer space at the receiver's side. Each sender maintains a window per each destination, which is used to bound the unacknowledged data a process can send point-to-point to the given destination. For each message it sends, the process deducts a certain amount of credit based on the size of the message. Messages are transmitted only if the sender has enough credit for them. Otherwise, messages are buffered at the sender. A receiver keeps track of the amount of unacknowledged data it has received from each sender. Whenever it decides to acknowledge a sender, it sends a message containing new amount of credit for this sender. On receipt of an acknowledgement message, sender recalculates the amount of credit for this receiver, and the buffered messages are sent based on the new credit. The receiver measures the bandwidth of the data being received from each sender. It starts with zero bandwidth, and adjusts it periodically with timeout pt2ptwp_sweep. On receipt of a point-to-point message, the receiver checks if the sender has passed threshold of its window, i.e. if the amount of data in point-to-point messages received from this sender since the last ack was sent to it has exceeded a certain ratio, pt2ptwp_ack_thresh, of the sender's window. If it is, an ack with some credit has to be sent to the sender. In order to adjust processes' windows according to their bandwidth, the receiver attempts to steal some credit from an appropriate process and add it to the sender's window. The receiver looks for a process with maximal window/bandwidth ratio, decreases its window by certain amount of credit and increases the window of the sender appropriately. Then the receiver sends the sender ack with the new amount of credit. When the process from which the credit was stolen passes theshold of its new, smaller window, the receiver sends ack to it.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/pt2ptwp.ml
Testing
 

9.21   REALKEYS

This layer is part of the dWGL suite. Together with OptRekey is implements the dWGL protocol. This layer's task is to actually perform the instructions passed to it from OptRekey, generate and pass securely all group subkeys, and finally the group key.
Protocol
 

When a Rekey operation is performed a complex set of layers and protocols is set into motion. Eventually, each group member receives a new keygraph and a set of instructions describing how to merge its partial keytree with the rest of the group keytrees to achieve a unified group tree. The head of the keytree is the group key. The instructions are implemented in several stages by the subleaders:
  1. Choose new keys, and send them securely to peer subleaders using secure channels.
  2. Get new keys through secure channels. Disseminate these keys by encrypting them with the top subtree key, and sending pt-2-pt to the leader.
  3. When the leader gets all 2nd stage messages, it bundles them into a single multicast and sends to the group.
  4. A member p that receives the multicast, extracts the set of keys it should know. Member p creates an ERekeyPrcl event with the new group key attached. The event it send down to PerfRekey notifing it that the protocol is complete.
Properties
 

Sources
 

layers/realkeys.ml
layers/type/tdefs.ml,mli
Generated Events
 

ESecureMsg
Dn(ECast)
Dn(ESend)
Testing
 

9.22   REKEY

This layers switches the group key upon request. There may be several reasons for switching the key: This layer also relies on the Secchan layer to create secure channels when required. A secure channel is essentially a way to pass confidential information between two endpoints. The Secchan layer creates secure channels upon demand and caches them for future use. This allows the new group key to be disseminated efficiently and confidentially through the tree.
Protocol
 

When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator generates a new key and sends it to its children using secure channels. The children pass it down the tree. Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event. The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
Parameters
 

Properties
 

Sources
 

layers/rekey.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

This layer was originally written by Mark Hadyen with Zhen Xiao. Ohad Rodeh later rewrote the security layers and related infrastructure.

9.23   REKEY_DT

This is the default rekeying layer. The basic data structure used is a tree of secure channels. This tree changes every view-change, therefore the name of the layer. Dynamic Tree REKEY. The basic problem in obtaining efficient rekeying is the high cost of constructing secure channels. A secure channel is established using a two-way handshake using a Diffie-Hellman exchange. At the time of writing, a PentiumIII 500Mhz can perform one side of a Diffie-Hellman exchange (using the OpenSSL cryptographic library) in 40 milliseconds. This is a heavyweight operation. To discuss the set of channels in a group, we shall view it as a graph where the nodes are group members, and the edges are secure channels connecting them. The strategy employed by REKEY_DT is to use a tree graph. When a rekey request is made by a user, in some view V, the leader multicasts a tree structure that uses, as much as possible, the existing set of edges. For example, if the view is composed of several previous components, then the leader attempts to merge together existing key-trees. If a single member joins, then it is located as close to the root as possible, for better tree-balancing. If a member leaves, then the tree may, in the worst case, split into three pieces. The leader fuses them together using (at most) 2 new secure channels. The leader chooses a new key and passes it to its children. The key is passed recursively down the tree until it reaches the leaves. The leaf nodes send acknowledgments back to the leader. This protocol has very good performance. It is even possible, that a rekey will not require any new secure-channels. For example, in case of member leave, where the node was a tree-leaf.
Protocol
 

When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator checks if the view is composed of a single tree-component. If not, it multicasts a Start message. All members that are tree-roots, sends their tree-structure to the leader. The leader merges the trees together, and multicasts the group-tree. It then chooses a new key and sends it down the tree. Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event. The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
Sources
 

layers/rekey_dt.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.24   REKEY_DIAM

This layer is closely related to REKEY_DT. It employs the same concept of a graph where the nodes are group members, and the edges are secure channels connecting them together. REKEY_DIAM attempts to improve the efficiency of a rekey after member leave. The point is to support ACL changes efficiently. If the application decides to change its ACL and remove a member, then the group-key must be switched as quickly as possible. As long as the previous group key is in place, the untrusted member can eavesdrop on group messaging. The key to a low-latency rekey protocol, is the elimination of costly Diffie-Hellman exchanges on it's critical path. A simple possibility is to arrange the members in a circle. If a member is removed, the circle is still one-connected, and confidential information can still pass through it. After the initial rekey, a reconstruction phase is initiated. During that phase, a new circle, connecting all surviving members is constructed. The problem with the circle structure, is that is has O(n) diameter. Since the diameter determines the latency of the protocol, we require a structure that has logarithmic diameter. We use a diamond graph, see examples in Figure 7.



Figure 7: Examples for diamonds


The protocol handles merges, partitions, and diamond-graph balancing. It guarantees very-low latency for the case of member leave. We clocked it at four milliseconds on 20 member groups.
Sources
 

layers/rekey_diam.ml

9.25   SECCHAN

This layer is responsible for sending and receiving private messages to/from group members. Privacy is guaranteed through the creation and maintenance of secure channels. A secure channel is, essentially, a symmetric key (unrelated to the group key) agreed upon between two members. This key is used to encrypt any confidential message sent between them. We allow layers above Secchan to send/receive confidential information using SecureMsg events. When a SecureMsg(dst,data) event arrives at Secchan, a secure channel to member dst is created (if one does not already exist). Then, the data is encrypted using the secure channel key and reliably sent to dst. This layer relies on an authentication engine - this is provided in system independent form by the Auth module. Currently, PGP is used for authentication. New random shared keys are generated by the Security module. The Security module also provides hashing and symmetric encryption functions. Currently RC4 is used for encryption and MD5 is used for hashing.
Protocol
 

A secure channel between members p and q is created using the following basic protocol:
  1. Member p chooses a new random symmetric key kpq. It creates a ticket to q that includes kpq using the Auth module ticket facility. Essentially, Auth encrypts kpq with q's public key and signs it using p's private key. Member p then sends the ticket to q.
  2. Member q authenticates and decrypts the message, and sends an acknowledgment (Ack) back to p.
This two-phase protocol is used to prevent the occurrence of a double channel. By this we mean the case where p and q open secure channels to each other at the same time. We augment the Ack phase; q discards p's ticket if:
  1. q has already started opening a channel to p
  2. q has a larger4 name than p.
Secchan also keeps the number of open channels, per member, below the secchan_cache_size configuration parameter. Regardless, a channel is closed if it's lifetime exceeds 8 hours (the setable secchan_ttl parameter). A two-phase protocol is used to close a channel. If members p and q share channel, assuming p created it, then p sends a CloseChan message to q. Member q responds by sending a CloseChanOk to p. It typically happens that many secure channels are created simultaneously group wide. For example, in the first Rekey of a group. If we tear down all these channels exactly 8 hours from their inception, the group will experience an explosion of management information. To prevent this, we stagger channel tear down times. Upon creation, a channel's maximal lifetime is set to 8 hours + I seconds where I is a random integer in the range [0 .. secchan_rand] . secchan_rand is set by default to 200 seconds, which we view as enough.
Properties
 

Parameters
 

Sources
 

layers/secchan.ml
layers/msecchan.ml
Generated Events
 

EChannelList
ESecureMsg
Dn(ECast)
Dn(ESend)
Testing
 

9.26   SEQUENCER

This layer implements a sequencer based protocol for total ordering.
Protocol
 

One member of the group serves as the sequencer. Any member that wishes to send messages, send them point-to-point to the sequencer. The sequencer then delivers the message localy, and cast it to the rest of the group. Other members, as soon as they receive a cast from the sequencer, they deliver the message. If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
Parameters
 

Properties
 

Sources
 

layers/sequencer.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

This layer and its documentation were written by Roy Friedman.

9.27   SLANDER

This protocol is used to share suspicions between members of a partition. This way, if one member suspects another member of being faulty, the coordinator is informed so that the faulty member is removed, even if the coordinator does not detect the failure. This ensures that partitions will occur even in the case of asymmetric network failures. Without the protocol, only when the coordinator notices the faulty member will the member be removed.
Protocol
 

The protocol works by broadcasting slander messages to other members whenever it recieves a new Suspect event. On the receipt of such a message, DnSuspect events are generated.
Parameters
 

Properties
 

Sources
 

layers/slander.ml
Generated Events
 

Dn(ESuspect)
Testing
 

This layer and its documentation were written by Zhen Xiao.

9.28   STABLE

This layer tracks the stability of broadcast messages and does failure detection. It keeps track of and gossips about an acknowledgement matrix, from which it occasionally computes the number of messages from each member that are stable and delivers this information in an Dn(EStable) event to the layer below (which will be bounced back up by a layer such as the BOTTOM layer).
Protocol
 

The stability protocol consists of each member keeping track of its view of an acknowledgement matrix. In this matrix, each entry, (A,B), corresponds to the number of member B's messages member A has acknowledged (the diagonal entries, (A,A), contain the number of broadcast messages sent by member A). The minimum of column A (disregarding entries for failed members) is the number of broadcast messages from A that are stable. The vector of these minimums is called the stability vector. The maximum of column A (disregarding entries of failed members) is the number of broadcast messages member A has sent that are held by at least one live member. The vector of the maximums is called the NumCast vector [there has got to be a better name]. Occasionally, each member gossips its row to the other members in the group. Occasionally, the protocol layer recomputes the stability and NumCast vectors and delivers them up in an Dn(EStable) event. To prevent a message storm when members gossip their stability vectors, each member adds an initial time-delta to its timer. The deltas are spread between zero and stable_spacing based on member rank. For example, if there are 10 members, and suspect_spacing is set to 1 second, then the deltas for members zero through nine are: 0.0, 0.1, .., 0.9.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/stable.ml
Generated Events
 

Up(EStable)
Dn(ECast)
Dn(ETimer)
Testing
 

9.29   SUSPECT

This layer regularly pings other members to check for suspected failures. Suspected failures are announce in a Dn(ESuspect) event to the layer below (which will be bounced back up by a layer such as the BOTTOM layer).
Protocol
 

Simple pinging protocol. Uses a sweep interval. On each sweep, Ping messages are broadcast unreliably to the entire group. Also, the number of sweep rounds since the last Ping was received from other members is checked and if it exceed the max_idle threshold then a Dn(ESuspect) event is generated.   To prevent a message storm when member's sweep timers expire, each member adds an initial time-delta to its sweep timer. The deltas are spread between zero and suspect_spacing based on member rank. For example, if there are 10 members, and suspect_spacing is set to 1 second, then the deltas for members zero through nine are: 0.0, 0.1, .., 0.9.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/suspect.ml
Generated Events
 

Dn(ESuspect)
Dn(ECast)
Dn(ETimer)
Testing
 

9.30   SYNC

This layer implements a protocol for blocking a group during view changes. One member initiates the SYNC protocol by delivering a Dn(EBlock) event from above. Other members will receive an Up(EBlock) event. After replying with a Dn(EBlockOk), the layers above the SYNC layer should not broadcast any further messages. Eventually, after all members have responded to the Up(EBlock) and all broadcast messages are stable, the member that delivered the Dn(EBlock) event will recieve an Up(EBlockOk) event.
Protocol
 

This protocol is very inefficient and needs to be reimplemented at some point. The Block request is broadcast by the coordinator. All members respond with another broadcast. When the coordinator gets all replies, it delivers up an Up(EBlockOk)
Parameters
 

Properties
 

Sources
 

layers/sync.ml
Generated Events
 

Up(EBlockOk)
Dn(EBlock)
Dn(ECast)
Testing
 

9.31   TOTEM

This layer implements the rotating token protocol for total ordering. (This is a variation on the protocol developed as part of the Totem project.)
Protocol
 

The protocol here is fairly simple: As soon as the stack becomes valid, the lowest ranked member starts rotating a token in the group. In order to send a message, a process must wait for the token. When the token arrives, all buffered messages are broadcast, and the token is passed to the next member. The token must be passed on even if there are no buffered messages. If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
Parameters
 

Properties
 

Sources
 

layers/totem.ml
Generated Events
 

Dn(ECast)
Testing
 

This layer and its documentation were written by Roy Friedman.

9.32   WINDOW

This layer implements window-based flow control based on stability information. Multicast messages from each sender are sent only if the number of unacknowledged messages from the sender is smaller than the window.
Protocol
 

Whenever the number of unstable messages goes above the window, messages are buffered without being sent. On receipt of a stability update, the number of unstable messages are recalculated and buffered messages are sent as allowed by the window.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/window.ml
This layer and its documentation were written by Takako Hickey.

9.33   XFER

This protocol facilitates application based state-transfer. The view structure contains a boolean field xfer_view conveying whether the current view is one where state-transfer is taking place (xfer_view = true) or whether it is a regular view (xfer_view = false).
Protocol
 

It is assumed that an application initiates state-transfer after a view change occurs. In the initial view, xfer_view = true. In a fault free run, each application sends pt-2-pt and multicast messages, according to its state-transfer protocol. Once the application-protocol is complete, an XferDone action is sent to Ensemble. This action is caught by the Xfer layer, where each member sends a pt-2-pt message XferMsg to the leader. When the leader collects XferMsg from all members, the state-transfer is complete, and a new view is installed with the xfer_view field set to false. When faults occur, and members fail during the state-transfer protocol, new views are installed with xfer_view set to true. This informs applications that state-transfer was not completed, and they can restart the protocol.
Notes
 

Parameters
 

Properties
 

Sources
 

layers/xfer.ml

9.34   ZBCAST

The ZBCAST layer implements a gossip-style probabilistically reliable multicast protocol. Unlike most other protocols in Ensemble, this protocol admits a small, but non-zero probability of message loss: a message might be garbage collected even though some operational member in the group has not received it yet. We found that doing so can offer dramatic improvements in the performance and scalability of the protocol.
Protocol
 

This protocol is composed of two sub-protocols structured roughly as in the Internet MUSE protocol. The first protocol is an unreliable multicast protocol which makes a best-effort attempt to efficiently deliver each message to its destinations. The second protocol is a 2-phase anti-entropy protocol that operates in a series of unsynchronized rounds. During each round, the first phase detects message loss; the second phase corrects such losses and runs only if needed.
Parameters
 

Properties
 

Sources
 

layers/zbcast.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Up(ELostMessage)
Testing
 

This layer and its documentation were written by Zhen Xiao. It is based on the PBCAST protocol implemented by Mark Hayden. This documentation is based the Bimodal Multicast paper.

9.35   VSYNC

Virtual synchrony is decomposed into a set of 8 independent protocol layers, listed in figure 3. The layers in this stack are decribed in the layer section.

name purpose
LEAVE reliable group leave
INTER inter-group view management
INTRA intra-group view management
ELECT leader election
MERGE reliable group merge
SYNC view change synchronization
PT2PT FIFO, reliable pt2pt
SUSPECT failure suspcions
STABLE broadcast stability
MNAK FIFO, agreed broadcast
BOTTOM bare-bones communication


Table 3: Virtual synchrony protocol stack


[TODO: here describe the overall protocol created by composing all the protocol layers]
Parameters
 

Protocol
 

[TODO: composition of protocols below]
Properties
 

Notes
 

Testing
 

A   Appendix: ML Does Not Allow Segmentation Faults

Normally, Ensemble should never experience segmentation faults. When they occur, there are only a few possible causes. We list these below along with fixes. Please inform us if you detect other sources of ``unsafety'' in Ensemble.

B   Ensemble Membership Service TCP Interface

[This is intended as an appendix to the Maestro paper (Maestro: A Group Structuring Tool For Applications With Multiple Quality of Service Requirements). It describes the exact TCP messaging interface to the group membership service described in that paper.] The description here is of the nuts-and-bolts TCP interface to the maestro membership service service described in the Ensemble tutorial. Ensemble also supports a direct interface to this service in ML. Developers using ML should probably use this interface instead. See appl/maestro/*.mli for the source code for the interface to this service.

B.1   Locating the service

The membership service uses the environment variable ENS_GROUPD_PORT to select a TCP port number to use. Client processes connect to this port in the normal fashion for TCP services. Client processes can join any number of groups over a single connection to a server, so they normally only connect once to the servers. If you run groupd on all the hosts from which your clients will be using the service, then processes can connect to the local port on their host. However, clients are not limited to using local servers, and can connect to any membership server on their system. If the TCP connection breaks, the membership service will fail the member from all groups that it joined. However, a client can reconnect to the same server and rejoin the groups it was in. If client's membership server crashes, it can reconnect to a different server.

B.2   Communicating with the service

Communication with the service is done through specially formatted messages. We describe the message types and their format here.
[messages:] Messages in both directions are formatted as follows. Both directions of the TCP streams are broken into variable-length packets. A packet has a header of size 8 of which the first 4 bytes are an unsigned integer in network byte order (NBO) giving the length of the message body (not including the header). The next 4 bytes must be zero (this is done for internal reasons, which we shall not go into here). The next message follows immediately after the body.
[integers:] Integers are unsigned and are formatted as 4 bytes in NBO.
[strings:] Strings have a somewhat more complex format. The first 4 bytes are an integer length (unsigned, NBO). The body of the string immidiately follows the length.
[endpoint and group identifiers:] These types have the same format as strings. For non-Ensemble applications, the contents can contain whatever the transport service you are using requires. Ensemble only tests the contents of endpoint and group identifiers for equality with other endpoints and groups.
[lists:] Lists have two parts. The first is an integer giving the number of elements in the list. Immediately following that are the elements in the list, one after the other and adjacent to one-another. It is assumed that the application knows the formats of the items in the list in order to break them up.



Figure 8: Client state machine diagram of the client-server membership protocol.


The actual messages sent between the client and the servers are composed of integers and strings. The first field of a message is an integer tag value from which the format of the remainder of the message can be determined.

C   Bimodal Multicast (by Ken Birman, Mark Hayden, and Zhen Xiao)

There are many methods for making a multicast protocol reliable. The majority protocols in Ensemble aim to provide virtually synchronous properties. However, these properties come with a price in terms of the possibility of unstable or unpredictable performance under stress and limited scalability. This is unacceptable to some applications where system stability and scalability are viewed as inextricable from other aspects of reliability. This section describes a bimodal multicast protocol that not only has much better scalability properties but also provides predictable reliability even under highly perturbed conditions. This work is described in the Bimodal Multicast paper (ncstrl.cornell/TR98-1683) by Ken Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu and Yaron Minsky (this documentation is based on that paper). The original version of the protocol was implemented by Mark Hayden. It was reimplemented by Zhen Xiao with many new optimizations and is described in the ZBCAST layer. In the remainder of the section, we will refer to our new protocol as Zbcast.

C.1   Protocol description

Zbcast protocol consists of two stages:

C.2   Usage

To use Zbcast protocol, specify the ``Zbcast'' property on the command line as follows(using perf demo as an example):

  perf -prog 1-n -add_prop Zbcast -groupd

This assumes that IP-multicast is available in the underlying network. Remember to set the related environment variables:

ENS_DEERING_PORT=38350
ENS_MODES=Deering:UDP

Otherwise the Gcast layer needs be linked into the stack:

  perf -prog 1-n -add_prop Zbcast -add_prop Gcast -groupd

Note that in both cases we need groupd to track group membership information. This is the state of art of the current implementation and is not something intrinsic to the protocol. If sufficient needs arise, we are going to remove this restriction. Message losses are reported to the application via Up(ELostMessage) event. The application can either ignore those messages (i.e. multimedia applications) or leave the process groups and then rejoin them, triggering state transfer.
*
Thanks to Takako Hickey, Roy Friedman, Robbert van Renesse, Zhen Xiao, and Ohad Rodeh for descriptions of their contributions.
1
MAC, Message Authentication Code. This is typically a keyed hash function.
2
symmetric encryption/MAC is roughly 1000 times faster than equivalent public key operations.
3
An Ensemble address is comprised of a set of identifiers, for example an IP address and a PGP principal name. Generally, an address includes an identifier for each communication medium the endpoint is using {UDP,TCP,MPI,ATM,..}.
4
Polymorphic comparison is used here.

This document was translated from LATEX by HEVEA.