Previous Contents Next

9   Layers and Stacks

We document a subset of the Ensemble layers and stacks (compositions of layers) in this section. This documentation is intended to be largely independent of the implementation language. They are currently listed in order, bottom-up, of their use in the VSYNC layer. Each layer (or stack) has these items in its documentation:

9.1   ANYLAYER

The name of the layer follwed by a general description of its purpose.
Protocol
 

A description of the protocol implemented by the layer.
Parameters
 

Properties
 

Notes
 

Sources
 

The source files for the ML implementation of the layer.
Generated Events
 

A list of event types generated by the layer. In the future, this field will contain more information, such as what event types are examined by the layer (instead of being blindly passed on). Hopefully, this information will eventually be generated automatically.
Testing
 

9.2   CREDIT

This layer implements a credit based flow control.
Protocol
 

On initialization, sender informs receivers how many credits it wants to keep in stock. Receivers sends credits whenever it finds that the sender is low on credits, either explicitly through a sender's request or implicitly through its local accounting. A credit is one time use only. Sender is allowed to send a message only if it has a credit available. If the sender does not have a credit, the message is buffered. Buffered messages are sent when new credits arrive. Credits are piggybacked to data messages whenever there is an opportunity of doing so to save bandwidth.
Parameters
 

Notes
 

Sources
 

layers/credit.ml
Last updated: Fri Mar 29, 1996

9.3   RATE

This layer implements a sender rate based flow control. Multicast messages from each sender are sent at a rate not exceeding some prescribed value.
Protocol
 

All the messages to be sent are buffered initially. Buffered messages are sent on periodic timeouts that are set based on the sender's rate.
Parameters
 

Notes
 

Sources
 

layers/rate.ml
This layer and its documentation were written by Takako Hickey.

9.4   BOTTOM

Not surprisingly, the BOTTOM layer is the bottommost layer in a Ensemble protocol stack. It interacts directly with the communication transport by sending/receiving messages and scheduling/handling timeouts. The properties implemented are all local to the protocol stack in which the layer exists: ie., a (dn)Fail event causes failed members to be removed from the local view of the group, but no failure message to be sent out--it is assumed that some other layer actually informs the other members of the failure.
Protocol
 

None
Parameters
 

Properties
 

Sources
 

layers/bottom.ml
Generated Events
 

Up(EBlock)
Up(ECast)
Up(EExit)
Up(EFail)
Up(EStable)
Up(EMergeDenied)
Up(EMergeGranted)
Up(EMergeRequest)
Up(ESend)
Up(ESuspect)
Up(ETimer)
Up(EView)
Testing
 

9.5   CAUSAL

The CAUSAL layer implements causally order multicast. It assumes reliable, FIFO ordered reliable messaging from layers below.
Protocol
 

The protocol has two versions: full and compressed vectors. First, we explain the simple version which uses full vectors. Then, we explain how these vectors are compressed. Each outgoing message is appended with a causal vector. This vector contains the last causally delivered message from each member in the group. Each received message is checked for deliverability. It may be delivered only if all messages which it causally follows, according to its causal vector, have been delivered. If it is not yet deliverable, it is delayed in the layer until delivery is possible. A view change erases all delayed messages, since they can never become deliverable. Causal vectors become large with the group size, so they must be compressed in order for this protocol to scale. The compression we use is derived from the Transis system. We demonstrate with an example: assume the membership includes three processes p,q and r. Process p sends message mp,1, q sends mq,1, causally following mp,1 and r sends mr,1 causally following mq,1. The causal vector for mr,1 is [1|1|1]. There is redundancy in the causal vector since it is clear that mr,1 follows mr,0. Furthermore, since mq,1 follows mp,1 we may omit stating that mr,1 follows mp,1. To conclude, it suffices to state that mr,1 follows mq,1. Using such optimizations causal vectors may be compressed considerably.
Sources
 

layers/causal.ml
Testing
 

This layer and its documentation were written by Ohad Rodeh.

9.6   ELECT

This layer implements a leader election protocol. It determines when a member should become the coordinator. Election is done by delivering an Dn(EElect) event at the new coordinator.
Protocol
 

When a member suspects all lower ranked members of being faulty, that member elects itself as coordinator.
Parameters
 

Properties
 

Sources
 

layers/elect.ml
Generated Events
 

Dn(EElect)
Testing
 

9.7   ENCRYPT

This layer encrypts application data for privacy. Uses keys in the view state record. Authentication needs to be provided by the lower layers in the system. The protocol headers are not encrypted. This layer must reside above FIFO layers for sending and receiving because it uses encryption contexts whereby the encryption of a message is dependent on the previous messages from this member. These contexts are dropped at the end of a view. A smarter protocol would try to maintain them, as they improve the quality of the encryption.
Protocol
 

Does chained encryption on the message payload in the iov field of events. Each member keeps track of the encryption state for all incoming and outgoing point-to-point and multicast channels. Messages marked Unreliable are not encrypted (these should not be application messages).
Parameters
 

Properties
 

Sources
 

layers/encrypt.ml
Generated Events
 

None
Testing
 

9.8   HEAL

This protocol is used to merge partitions of a group.
Protocol
 

The coordinator occasionally broadcasts the existence of this partition via Dn(EGossipExt) events. These are delivered unreliably to coordinators of other partitions. If a coordinator decides to merge partitions, then it prompts a view change and inserts the name of the remote coordinator in the Up(EBlockOk) event. The INTER protocol takes over from there. Merge cycles are prevented by only allowing merges to be made from smaller view id's to larger view id's.
Parameters
 

Properties
 

Sources
 

layers/heal.ml
Generated Events
 

Up(EPrompt)
Dn(EGossipExt)
Testing
 

9.9   INTER

This protocol handles view changes that involve more than one partition (see also INTRA).
Protocol
 

Group merges are the more complicated part of the group membership protocol. However, we constrain the problem so that: The merge protocol works as follows:
  1. The merging coordinator blocks its group,
  2. The merging coordinator sends a merge request to the remote group's coordinator.
  3. The remote coordinator blocks its group,
  4. The remote coordinator installs a new view (with the mergers in it) and sends the view to the merging coordinator (through a merge-granted message).
  5. The merging coordinator installs the view in its group.
If the merging coordinator times out on the merged coordinator then it immediately installs a new view in its partition (without the other members even finding out about the merge attempt).
Parameters
 

Properties
 

Sources
 

layers/inter.ml
Generated Events
 

Dn(EMerge)
Dn(EMergeDenied)
Dn(ESuspect)
Testing
 

9.10   INTRA

This layer manages group membership within a view (see also the INTER layer). There are three related tasks here:
Protocol
 

This is a relatively simple group membership protocol. We have done our best to resist the temptation to ``optimize'' special cases under which the group is ``unnecessarily'' partitioned. We also constrain the conditions under which operations such as merges can occur. The implementation does not ``touch'' any data messages: it only handles group membership changes. Furthermore, this protocol does not use any timeouts. Views and failures are forwarded via broadcast to the rest of the members. Other members accept the view/failure if they are consistent with their current representation of the group's state. Otherwise, the view/failure message is dropped and the sender is suspected of being problematic.
Parameters
 

Properties
 

Sources
 

layers/intra.ml
Generated Events
 

Dn(ECast)
Dn(EFail)
Dn(ESuspect)
Dn(EView)
Testing
 

9.11   LEAVE

This protocol has two tasks. (1) When a member really wants to leave a group, the LEAVE protocol tells the other members to suspect this member. (2) The leave protocol garbage collects old protocol stacks by initiating a Dn(ELeave) after getting an Up(EView) and then getting an Up(EStable) where everything is marked as being stable.
Protocol
 

Both protocols are simple. For leaving the group, a member broacasts a Leave message to the group which causes the other members to deliver a Dn(ESuspect) event. Note that the other members will get the Leave message only after receiving all the prior broadcast messages. This member should probably stick around, however, until these messages have stabilized. Garbage collection is done by waiting until all broadcast message are stable before delivering a local Dn(ELeave) event.
Parameters
 

Properties
 

Sources
 

layers/leave.ml
Generated Events
 

Dn(ELeave)
Testing
 

9.12   MERGE

This protocol provides reliable retransmissions of merge messages and failure detection of remote coordinators when merging.
Protocol
 

Simple retransmission protocol. A hash table is used to detect copied merge requests, which are dropped.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/merge.ml
Generated Events
 

Up(ESuspect)
Dn(EMerge)
Dn(ETimer)
Testing
 

9.13   MFLOW

This layer implements window-based flow control for multicast messages. Multicast messages from each sender are transmitted only if the number of send credit left is greater than zero. The protocol attempts to avoid situations where all recievers send credit at the same time, so that a sender is not flooded with credit messages.
Protocol
 

Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/mflow.ml
Testing
 

This layer and its documentation were written with Zhen Xiao.

9.14   MNAK

The MNAK (Multicast NAK) layer implements a reliable, agreed, FIFO-ordered broadcast protocol. Broadcast messages from each sender are delivered in FIFO-order at their destinations. Messages from live members are delivered reliably and messages from failed members are retransmitted by the coordinator of the group. When all failed members are marked as such, the protocol guarantees that eventually all live members will have delivered the same set of messages.
Protocol
 

Uses a negative acknowledgment (NAK) protocol: when messages are detected to be out of order (or the NumCast field in an Up(EStable) event detects missing messages), a NAK is sent. The NAK is sent in one of three ways, chosen in the following order:
  1. Pt2pt to the sender, if the sender is not failed.
  2. Pt2pt to the coordinator, if the reciever is not the coordinator.
  3. Broadcast to the rest of the group if the receiver is the coordinator.
All broadcast messages are buffered until stable.
Parameters
 

Properties
 

Sources
 

layers/mnak.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.15   OPTREKEY

This layer is part of the dWGL suite. Together with RealKeys, it implements the dWGL rekeying algorithm. The specific task performed by OptRekey is computing the new group keygraph.

(a)
(b) (c)


Figure 6: The effect of leave on a group key-graph of a group G of eight members. (1) The initial keygraph. (2) The tree after member p1 leaves. (3) The merged tree.


Briefly, a keygraph is a graph where all group members form the leaves, and the inner nodes are shared sub-keys. A member knows all the keys on the route from itself to the root. The top key is the group key, and it is known by all members. For example, Figure 6(a) depicts a group G of eight members {p1 ... p8} and their subkeys. When a member leaves the group, all the keys known to it must be discarded. This splits a group into a set of subtrees. Figure 6(b) shows G after member p1 has left. In order to re-merge the group keygraph, the subtrees should be merged. This can be seen in Figure 6(c). A subleader is the leader of a subtree. In our example, member p2 is the leader of {p2}, p3 is the leader of {p3,p4}, and p5 is the leader of {p5,p6,p7,p8}.
Protocol
 

This layer is activated upon a Rekey action. The leader receives an ERekeyPrcl event, and starts the OptRekey protocol. Typically, a Rekey will follow a join or a leave. Hence, the group keygraph is initially fragmented. This layer's task is to remerge it. The protocol employed is as follows:
  1. The leader multicasts Start.
  2. Subleaders send their keygraphs to the leader.
  3. The leader computes an optimal new keygraph.
  4. The leader multicasts the new keygraph.
  5. Members receive the keygraph and send it up using a ERekeyPrcl event to the RealKeys layer.
An optimal keygraph is complex to compute, an auxiliary module is used for this task. Note that OptRekey is designed so that only subleaders participate. In the normal case, where a single member joins or leaves, this will include log2n members. It is possible that a Rekey will be initiated even though membership hasn't changed. This case is specially handled, since it can be executed with nearly no communication.
Properties
 

Sources
 

layers/optrekey.ml
layers/util/tree.ml,mli
layers/type/tdefs.ml,mli
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.16   PERFREKEY

This layer is responsible for common management tasks related to group rekeying. Above PerfRekey, a rekeying layer is situated. At the time of writing there are four options: Rekey, RealKeys+OptRekey, Rekey_dt, and Rekey_diam. The Rekey layer implements a very simple rekeying protocol, RealKeys and OptRekey layers together implement the dWGL protocol. Rekey_dt implements a dynamic-tree based protocol, and Rekey_diam uses a diamond-like graph.
Protocol
 

The layer comes into effect when a Rekey operation is initiated by the user. It is bounced by the Bottom layer as a Rekey event and received at PerfRekey. From this point, following protocol is used:
  1. The Rekey action is diverted to the leader.
  2. The leader initiates the rekey sequence by passing the request up to Rekey/OptRekey/Rekey_dt/Rekey_diam.
  3. Once rekeying is done, the members pass a RekeyPrcl event with the new group key back down.
  4. PerfRekey logs the new group key. A tree spanning the group is computed through which acks will propagate. The leaves sends Acks up the tree.
  5. When Acks from all the children are received at the leader, it prompts the group for a view change.
In the upcoming view, the new key will be installed. Another rekeying flavor includes a Cleanup stage. Every couple of hours, the set of cached secure channels, and other key-ing material should be removed. This prevents an adversary from using cryptanalysis to break the set of symmetric keys in use by the system. To this end, PerfRekey supports an optional cleanup stage prior to actual rekeying. This is a sub-protocol that works as follows:
  1. The leader multicasts a Cleanup message.
  2. All members remove all their cached key-material from all security layers. A ERekeyCleanup event is sent down to Secchan, bounced up to Rekey/OptRekey+RealKeys/.., and bounced back down to PerfRekey.
  3. All members send CleanupOk to the leader through the Ack-tree.
  4. When the leader receives CleanupOk from all the members, it starts the Rekey protocol itself.
By default, cleanup is perform every 24hours. This is a settable parameter that the application can decide upon. Rekeying may fail due to member failure or due to a merge that occurs during the execution. In this case, the new key is discarded and the old key is kept. PerfRekey supports persistent rekeying: when the 24hour timeout is over, a rekey will ensue no-matter how many failures occur. The Top layer checks that all members in a view a trusted. Any untrusted member is removed from the group through a Suspicion event. Trust is established using the Exchange layer, and the user access control policy.
Properties
 

Parameters
 

Sources
 

layers/perfrekey.ml
Generated Events
 

EPrompt
ERekeyPrcl
Dn(ECast)
Dn(ESend)
Testing
 

9.17   PRIMARY

Detect primary partition in a group. Usually a primary partition has the majority of members or holds some important resources.
Protocol
 

Upon Up(EInit) event, a member sends a message to the coordinator, claiming that it is in the current view. When a view has the majority of members, its coordinator prompts a view change to make itself the primary partition if it is not yet. When a new view is ready, it decides whether it is primary and mark it as so.
Parameters
 

Properties
 

Sources
 

layers/primary.ml
Generated Events
 

Dn(EPrompt)
Dn(ESend)
Testing
 

This layer and its documentation were written with Zhen Xiao.

9.18   PT2PT

This layer implements reliable point-to-point message delivery. [TODO: finish this documentation]
Parameters
 

Testing
 

9.19   PT2PTW

This layer implements window-based flow control for point to point messages. Point-to-point messages from each sender are transmitted only if the window is not yet full.
Protocol
 

Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit. Acknowledgements are sent whenever a speicified threshhold is passed.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/pt2ptw.ml
Testing
 

Last updated: March 21, 1997 '

9.20   PT2PTWP

This layer implements an adaptive window-based flow control protocol for point-to-point communication between the group members. In this protocol the receiver's buffer space is shared between all group members. This is accomplished by dividing the receiver's window among the senders according to the bandwidth of the data being received from each sender. Such way of sharing attempts to minimize the number of ack messages, i.e. to increase message efficiency.
Protocol
 

In the following, the term acknowledgement is used with the meaning of flow control protocols and not that of reliable communication protocols. This protocol uses credits to measure the available buffer space at the receiver's side. Each sender maintains a window per each destination, which is used to bound the unacknowledged data a process can send point-to-point to the given destination. For each message it sends, the process deducts a certain amount of credit based on the size of the message. Messages are transmitted only if the sender has enough credit for them. Otherwise, messages are buffered at the sender. A receiver keeps track of the amount of unacknowledged data it has received from each sender. Whenever it decides to acknowledge a sender, it sends a message containing new amount of credit for this sender. On receipt of an acknowledgement message, sender recalculates the amount of credit for this receiver, and the buffered messages are sent based on the new credit. The receiver measures the bandwidth of the data being received from each sender. It starts with zero bandwidth, and adjusts it periodically with timeout pt2ptwp_sweep. On receipt of a point-to-point message, the receiver checks if the sender has passed threshold of its window, i.e. if the amount of data in point-to-point messages received from this sender since the last ack was sent to it has exceeded a certain ratio, pt2ptwp_ack_thresh, of the sender's window. If it is, an ack with some credit has to be sent to the sender. In order to adjust processes' windows according to their bandwidth, the receiver attempts to steal some credit from an appropriate process and add it to the sender's window. The receiver looks for a process with maximal window/bandwidth ratio, decreases its window by certain amount of credit and increases the window of the sender appropriately. Then the receiver sends the sender ack with the new amount of credit. When the process from which the credit was stolen passes theshold of its new, smaller window, the receiver sends ack to it.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/pt2ptwp.ml
Testing
 

9.21   REALKEYS

This layer is part of the dWGL suite. Together with OptRekey is implements the dWGL protocol. This layer's task is to actually perform the instructions passed to it from OptRekey, generate and pass securely all group subkeys, and finally the group key.
Protocol
 

When a Rekey operation is performed a complex set of layers and protocols is set into motion. Eventually, each group member receives a new keygraph and a set of instructions describing how to merge its partial keytree with the rest of the group keytrees to achieve a unified group tree. The head of the keytree is the group key. The instructions are implemented in several stages by the subleaders:
  1. Choose new keys, and send them securely to peer subleaders using secure channels.
  2. Get new keys through secure channels. Disseminate these keys by encrypting them with the top subtree key, and sending pt-2-pt to the leader.
  3. When the leader gets all 2nd stage messages, it bundles them into a single multicast and sends to the group.
  4. A member p that receives the multicast, extracts the set of keys it should know. Member p creates an ERekeyPrcl event with the new group key attached. The event it send down to PerfRekey notifing it that the protocol is complete.
Properties
 

Sources
 

layers/realkeys.ml
layers/type/tdefs.ml,mli
Generated Events
 

ESecureMsg
Dn(ECast)
Dn(ESend)
Testing
 

9.22   REKEY

This layers switches the group key upon request. There may be several reasons for switching the key: This layer also relies on the Secchan layer to create secure channels when required. A secure channel is essentially a way to pass confidential information between two endpoints. The Secchan layer creates secure channels upon demand and caches them for future use. This allows the new group key to be disseminated efficiently and confidentially through the tree.
Protocol
 

When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator generates a new key and sends it to its children using secure channels. The children pass it down the tree. Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event. The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
Parameters
 

Properties
 

Sources
 

layers/rekey.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

This layer was originally written by Mark Hadyen with Zhen Xiao. Ohad Rodeh later rewrote the security layers and related infrastructure.

9.23   REKEY_DT

This is the default rekeying layer. The basic data structure used is a tree of secure channels. This tree changes every view-change, therefore the name of the layer. Dynamic Tree REKEY. The basic problem in obtaining efficient rekeying is the high cost of constructing secure channels. A secure channel is established using a two-way handshake using a Diffie-Hellman exchange. At the time of writing, a PentiumIII 500Mhz can perform one side of a Diffie-Hellman exchange (using the OpenSSL cryptographic library) in 40 milliseconds. This is a heavyweight operation. To discuss the set of channels in a group, we shall view it as a graph where the nodes are group members, and the edges are secure channels connecting them. The strategy employed by REKEY_DT is to use a tree graph. When a rekey request is made by a user, in some view V, the leader multicasts a tree structure that uses, as much as possible, the existing set of edges. For example, if the view is composed of several previous components, then the leader attempts to merge together existing key-trees. If a single member joins, then it is located as close to the root as possible, for better tree-balancing. If a member leaves, then the tree may, in the worst case, split into three pieces. The leader fuses them together using (at most) 2 new secure channels. The leader chooses a new key and passes it to its children. The key is passed recursively down the tree until it reaches the leaves. The leaf nodes send acknowledgments back to the leader. This protocol has very good performance. It is even possible, that a rekey will not require any new secure-channels. For example, in case of member leave, where the node was a tree-leaf.
Protocol
 

When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator checks if the view is composed of a single tree-component. If not, it multicasts a Start message. All members that are tree-roots, sends their tree-structure to the leader. The leader merges the trees together, and multicasts the group-tree. It then chooses a new key and sends it down the tree. Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event. The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
Sources
 

layers/rekey_dt.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

9.24   REKEY_DIAM

This layer is closely related to REKEY_DT. It employs the same concept of a graph where the nodes are group members, and the edges are secure channels connecting them together. REKEY_DIAM attempts to improve the efficiency of a rekey after member leave. The point is to support ACL changes efficiently. If the application decides to change its ACL and remove a member, then the group-key must be switched as quickly as possible. As long as the previous group key is in place, the untrusted member can eavesdrop on group messaging. The key to a low-latency rekey protocol, is the elimination of costly Diffie-Hellman exchanges on it's critical path. A simple possibility is to arrange the members in a circle. If a member is removed, the circle is still one-connected, and confidential information can still pass through it. After the initial rekey, a reconstruction phase is initiated. During that phase, a new circle, connecting all surviving members is constructed. The problem with the circle structure, is that is has O(n) diameter. Since the diameter determines the latency of the protocol, we require a structure that has logarithmic diameter. We use a diamond graph, see examples in Figure 7.



Figure 7: Examples for diamonds


The protocol handles merges, partitions, and diamond-graph balancing. It guarantees very-low latency for the case of member leave. We clocked it at four milliseconds on 20 member groups.
Sources
 

layers/rekey_diam.ml

9.25   SECCHAN

This layer is responsible for sending and receiving private messages to/from group members. Privacy is guaranteed through the creation and maintenance of secure channels. A secure channel is, essentially, a symmetric key (unrelated to the group key) agreed upon between two members. This key is used to encrypt any confidential message sent between them. We allow layers above Secchan to send/receive confidential information using SecureMsg events. When a SecureMsg(dst,data) event arrives at Secchan, a secure channel to member dst is created (if one does not already exist). Then, the data is encrypted using the secure channel key and reliably sent to dst. This layer relies on an authentication engine - this is provided in system independent form by the Auth module. Currently, PGP is used for authentication. New random shared keys are generated by the Security module. The Security module also provides hashing and symmetric encryption functions. Currently RC4 is used for encryption and MD5 is used for hashing.
Protocol
 

A secure channel between members p and q is created using the following basic protocol:
  1. Member p chooses a new random symmetric key kpq. It creates a ticket to q that includes kpq using the Auth module ticket facility. Essentially, Auth encrypts kpq with q's public key and signs it using p's private key. Member p then sends the ticket to q.
  2. Member q authenticates and decrypts the message, and sends an acknowledgment (Ack) back to p.
This two-phase protocol is used to prevent the occurrence of a double channel. By this we mean the case where p and q open secure channels to each other at the same time. We augment the Ack phase; q discards p's ticket if:
  1. q has already started opening a channel to p
  2. q has a larger4 name than p.
Secchan also keeps the number of open channels, per member, below the secchan_cache_size configuration parameter. Regardless, a channel is closed if it's lifetime exceeds 8 hours (the setable secchan_ttl parameter). A two-phase protocol is used to close a channel. If members p and q share channel, assuming p created it, then p sends a CloseChan message to q. Member q responds by sending a CloseChanOk to p. It typically happens that many secure channels are created simultaneously group wide. For example, in the first Rekey of a group. If we tear down all these channels exactly 8 hours from their inception, the group will experience an explosion of management information. To prevent this, we stagger channel tear down times. Upon creation, a channel's maximal lifetime is set to 8 hours + I seconds where I is a random integer in the range [0 .. secchan_rand] . secchan_rand is set by default to 200 seconds, which we view as enough.
Properties
 

Parameters
 

Sources
 

layers/secchan.ml
layers/msecchan.ml
Generated Events
 

EChannelList
ESecureMsg
Dn(ECast)
Dn(ESend)
Testing
 

9.26   SEQUENCER

This layer implements a sequencer based protocol for total ordering.
Protocol
 

One member of the group serves as the sequencer. Any member that wishes to send messages, send them point-to-point to the sequencer. The sequencer then delivers the message localy, and cast it to the rest of the group. Other members, as soon as they receive a cast from the sequencer, they deliver the message. If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
Parameters
 

Properties
 

Sources
 

layers/sequencer.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Testing
 

This layer and its documentation were written by Roy Friedman.

9.27   SLANDER

This protocol is used to share suspicions between members of a partition. This way, if one member suspects another member of being faulty, the coordinator is informed so that the faulty member is removed, even if the coordinator does not detect the failure. This ensures that partitions will occur even in the case of asymmetric network failures. Without the protocol, only when the coordinator notices the faulty member will the member be removed.
Protocol
 

The protocol works by broadcasting slander messages to other members whenever it recieves a new Suspect event. On the receipt of such a message, DnSuspect events are generated.
Parameters
 

Properties
 

Sources
 

layers/slander.ml
Generated Events
 

Dn(ESuspect)
Testing
 

This layer and its documentation were written by Zhen Xiao.

9.28   STABLE

This layer tracks the stability of broadcast messages and does failure detection. It keeps track of and gossips about an acknowledgement matrix, from which it occasionally computes the number of messages from each member that are stable and delivers this information in an Dn(EStable) event to the layer below (which will be bounced back up by a layer such as the BOTTOM layer).
Protocol
 

The stability protocol consists of each member keeping track of its view of an acknowledgement matrix. In this matrix, each entry, (A,B), corresponds to the number of member B's messages member A has acknowledged (the diagonal entries, (A,A), contain the number of broadcast messages sent by member A). The minimum of column A (disregarding entries for failed members) is the number of broadcast messages from A that are stable. The vector of these minimums is called the stability vector. The maximum of column A (disregarding entries of failed members) is the number of broadcast messages member A has sent that are held by at least one live member. The vector of the maximums is called the NumCast vector [there has got to be a better name]. Occasionally, each member gossips its row to the other members in the group. Occasionally, the protocol layer recomputes the stability and NumCast vectors and delivers them up in an Dn(EStable) event. To prevent a message storm when members gossip their stability vectors, each member adds an initial time-delta to its timer. The deltas are spread between zero and stable_spacing based on member rank. For example, if there are 10 members, and suspect_spacing is set to 1 second, then the deltas for members zero through nine are: 0.0, 0.1, .., 0.9.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/stable.ml
Generated Events
 

Up(EStable)
Dn(ECast)
Dn(ETimer)
Testing
 

9.29   SUSPECT

This layer regularly pings other members to check for suspected failures. Suspected failures are announce in a Dn(ESuspect) event to the layer below (which will be bounced back up by a layer such as the BOTTOM layer).
Protocol
 

Simple pinging protocol. Uses a sweep interval. On each sweep, Ping messages are broadcast unreliably to the entire group. Also, the number of sweep rounds since the last Ping was received from other members is checked and if it exceed the max_idle threshold then a Dn(ESuspect) event is generated.   To prevent a message storm when member's sweep timers expire, each member adds an initial time-delta to its sweep timer. The deltas are spread between zero and suspect_spacing based on member rank. For example, if there are 10 members, and suspect_spacing is set to 1 second, then the deltas for members zero through nine are: 0.0, 0.1, .., 0.9.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/suspect.ml
Generated Events
 

Dn(ESuspect)
Dn(ECast)
Dn(ETimer)
Testing
 

9.30   SYNC

This layer implements a protocol for blocking a group during view changes. One member initiates the SYNC protocol by delivering a Dn(EBlock) event from above. Other members will receive an Up(EBlock) event. After replying with a Dn(EBlockOk), the layers above the SYNC layer should not broadcast any further messages. Eventually, after all members have responded to the Up(EBlock) and all broadcast messages are stable, the member that delivered the Dn(EBlock) event will recieve an Up(EBlockOk) event.
Protocol
 

This protocol is very inefficient and needs to be reimplemented at some point. The Block request is broadcast by the coordinator. All members respond with another broadcast. When the coordinator gets all replies, it delivers up an Up(EBlockOk)
Parameters
 

Properties
 

Sources
 

layers/sync.ml
Generated Events
 

Up(EBlockOk)
Dn(EBlock)
Dn(ECast)
Testing
 

9.31   TOTEM

This layer implements the rotating token protocol for total ordering. (This is a variation on the protocol developed as part of the Totem project.)
Protocol
 

The protocol here is fairly simple: As soon as the stack becomes valid, the lowest ranked member starts rotating a token in the group. In order to send a message, a process must wait for the token. When the token arrives, all buffered messages are broadcast, and the token is passed to the next member. The token must be passed on even if there are no buffered messages. If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
Parameters
 

Properties
 

Sources
 

layers/totem.ml
Generated Events
 

Dn(ECast)
Testing
 

This layer and its documentation were written by Roy Friedman.

9.32   WINDOW

This layer implements window-based flow control based on stability information. Multicast messages from each sender are sent only if the number of unacknowledged messages from the sender is smaller than the window.
Protocol
 

Whenever the number of unstable messages goes above the window, messages are buffered without being sent. On receipt of a stability update, the number of unstable messages are recalculated and buffered messages are sent as allowed by the window.
Parameters
 

Properties
 

Notes
 

Sources
 

layers/window.ml
This layer and its documentation were written by Takako Hickey.

9.33   XFER

This protocol facilitates application based state-transfer. The view structure contains a boolean field xfer_view conveying whether the current view is one where state-transfer is taking place (xfer_view = true) or whether it is a regular view (xfer_view = false).
Protocol
 

It is assumed that an application initiates state-transfer after a view change occurs. In the initial view, xfer_view = true. In a fault free run, each application sends pt-2-pt and multicast messages, according to its state-transfer protocol. Once the application-protocol is complete, an XferDone action is sent to Ensemble. This action is caught by the Xfer layer, where each member sends a pt-2-pt message XferMsg to the leader. When the leader collects XferMsg from all members, the state-transfer is complete, and a new view is installed with the xfer_view field set to false. When faults occur, and members fail during the state-transfer protocol, new views are installed with xfer_view set to true. This informs applications that state-transfer was not completed, and they can restart the protocol.
Notes
 

Parameters
 

Properties
 

Sources
 

layers/xfer.ml

9.34   ZBCAST

The ZBCAST layer implements a gossip-style probabilistically reliable multicast protocol. Unlike most other protocols in Ensemble, this protocol admits a small, but non-zero probability of message loss: a message might be garbage collected even though some operational member in the group has not received it yet. We found that doing so can offer dramatic improvements in the performance and scalability of the protocol.
Protocol
 

This protocol is composed of two sub-protocols structured roughly as in the Internet MUSE protocol. The first protocol is an unreliable multicast protocol which makes a best-effort attempt to efficiently deliver each message to its destinations. The second protocol is a 2-phase anti-entropy protocol that operates in a series of unsynchronized rounds. During each round, the first phase detects message loss; the second phase corrects such losses and runs only if needed.
Parameters
 

Properties
 

Sources
 

layers/zbcast.ml
Generated Events
 

Dn(ECast)
Dn(ESend)
Up(ELostMessage)
Testing
 

This layer and its documentation were written by Zhen Xiao. It is based on the PBCAST protocol implemented by Mark Hayden. This documentation is based the Bimodal Multicast paper.

9.35   VSYNC

Virtual synchrony is decomposed into a set of 8 independent protocol layers, listed in figure 3. The layers in this stack are decribed in the layer section.

name purpose
LEAVE reliable group leave
INTER inter-group view management
INTRA intra-group view management
ELECT leader election
MERGE reliable group merge
SYNC view change synchronization
PT2PT FIFO, reliable pt2pt
SUSPECT failure suspcions
STABLE broadcast stability
MNAK FIFO, agreed broadcast
BOTTOM bare-bones communication


Table 3: Virtual synchrony protocol stack


[TODO: here describe the overall protocol created by composing all the protocol layers]
Parameters
 

Protocol
 

[TODO: composition of protocols below]
Properties
 

Notes
 

Testing
 


Previous Contents Next