A description of the protocol implemented by the layer.
On initialization, sender informs receivers how many credits it wants to keep in stock. Receivers sends credits whenever it finds that the sender is low on credits, either explicitly through a sender's request or implicitly through its local accounting. A credit is one time use only. Sender is allowed to send a message only if it has a credit available. If the sender does not have a credit, the message is buffered. Buffered messages are sent when new credits arrive. Credits are piggybacked to data messages whenever there is an opportunity of doing so to save bandwidth.
Last updated: Fri Mar 29, 1996
layers/credit.ml
All the messages to be sent are buffered initially. Buffered messages are sent on periodic timeouts that are set based on the sender's rate.
This layer and its documentation were written by Takako Hickey.
layers/rate.ml
None
layers/bottom.ml
Up(EBlock) Up(ECast) Up(EExit) Up(EFail) Up(EStable) Up(EMergeDenied) Up(EMergeGranted) Up(EMergeRequest) Up(ESend) Up(ESuspect) Up(ETimer) Up(EView)
The protocol has two versions: full and compressed vectors. First, we explain the simple version which uses full vectors. Then, we explain how these vectors are compressed.
Each outgoing message is appended with a causal vector. This vector contains the last causally delivered message from each member in the group. Each received message is checked for deliverability. It may be delivered only if all messages which it causally follows, according to its causal vector, have been delivered. If it is not yet deliverable, it is delayed in the layer until delivery is possible. A view change erases all delayed messages, since they can never become deliverable.
Causal vectors become large with the group size, so they must be compressed in order for this protocol to scale. The compression we use is derived from the Transis system. We demonstrate with an example: assume the membership includes three processes p,q and r. Process p sends message mp,1, q sends mq,1, causally following mp,1 and r sends mr,1 causally following mq,1. The causal vector for mr,1 is [1|1|1]. There is redundancy in the causal vector since it is clear that mr,1 follows mr,0. Furthermore, since mq,1 follows mp,1 we may omit stating that mr,1 follows mp,1. To conclude, it suffices to state that mr,1 follows mq,1. Using such optimizations causal vectors may be compressed considerably.
layers/causal.ml
When a member suspects all lower ranked members of being faulty, that member elects itself as coordinator.
layers/elect.ml
Dn(EElect)
Does chained encryption on the message payload in the iov field of events. Each member keeps track of the encryption state for all incoming and outgoing point-to-point and multicast channels. Messages marked Unreliable are not encrypted (these should not be application messages).
layers/encrypt.ml
None
The coordinator occasionally broadcasts the existence of this partition via Dn(EGossipExt) events. These are delivered unreliably to coordinators of other partitions. If a coordinator decides to merge partitions, then it prompts a view change and inserts the name of the remote coordinator in the Up(EBlockOk) event. The INTER protocol takes over from there. Merge cycles are prevented by only allowing merges to be made from smaller view id's to larger view id's.
layers/heal.ml
Up(EPrompt) Dn(EGossipExt)
Group merges are the more complicated part of the group membership protocol. However, we constrain the problem so that:The merge protocol works as follows:
- Groups cannot be both merging and accepting mergers at the same time. This eliminates the potential for cycles in the ``merge-graph.''
- A view (i.e. view_id) can only attempt to merge once, and only if no failures have occured. Each merge attempt is therefore uniquely identified by the view_id of the merging group. Note also that by requiring no failures to have occured for a merge to happen, this prevents a member from being failed in one view and then reappearing in the next view. There has to be an intermediate view without the failed member: this is a desirable property.
If the merging coordinator times out on the merged coordinator then it immediately installs a new view in its partition (without the other members even finding out about the merge attempt).
- The merging coordinator blocks its group,
- The merging coordinator sends a merge request to the remote group's coordinator.
- The remote coordinator blocks its group,
- The remote coordinator installs a new view (with the mergers in it) and sends the view to the merging coordinator (through a merge-granted message).
- The merging coordinator installs the view in its group.
layers/inter.ml
Dn(EMerge) Dn(EMergeDenied) Dn(ESuspect)
This is a relatively simple group membership protocol. We have done our best to resist the temptation to ``optimize'' special cases under which the group is ``unnecessarily'' partitioned. We also constrain the conditions under which operations such as merges can occur. The implementation does not ``touch'' any data messages: it only handles group membership changes. Furthermore, this protocol does not use any timeouts.
Views and failures are forwarded via broadcast to the rest of the members. Other members accept the view/failure if they are consistent with their current representation of the group's state. Otherwise, the view/failure message is dropped and the sender is suspected of being problematic.
layers/intra.ml
Dn(ECast) Dn(EFail) Dn(ESuspect) Dn(EView)
Both protocols are simple.
For leaving the group, a member broacasts a Leave message to the group which causes the other members to deliver a Dn(ESuspect) event. Note that the other members will get the Leave message only after receiving all the prior broadcast messages. This member should probably stick around, however, until these messages have stabilized.
Garbage collection is done by waiting until all broadcast message are stable before delivering a local Dn(ELeave) event.
layers/leave.ml
Dn(ELeave)
Simple retransmission protocol. A hash table is used to detect copied merge requests, which are dropped.
layers/merge.ml
Up(ESuspect) Dn(EMerge) Dn(ETimer)
Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit.
layers/mflow.ml
Uses a negative acknowledgment (NAK) protocol: when messages are detected to be out of order (or the NumCast field in an Up(EStable) event detects missing messages), a NAK is sent. The NAK is sent in one of three ways, chosen in the following order:All broadcast messages are buffered until stable.
- Pt2pt to the sender, if the sender is not failed.
- Pt2pt to the coordinator, if the reciever is not the coordinator.
- Broadcast to the rest of the group if the receiver is the coordinator.
layers/mnak.ml
Dn(ECast) Dn(ESend)
Briefly, a keygraph is a graph where all group members form the leaves, and the inner nodes are shared sub-keys. A member knows all the keys on the route from itself to the root. The top key is the group key, and it is known by all members. For example, Figure ??(a) depicts a group G of eight members {p1 ... p8} and their subkeys. When a member leaves the group, all the keys known to it must be discarded. This splits a group into a set of subtrees. Figure ??(b) shows G after member p1 has left. In order to re-merge the group keygraph, the subtrees should be merged. This can be seen in Figure ??(c). A subleader is the leader of a subtree. In our example, member p2 is the leader of {p2}, p3 is the leader of {p3,p4}, and p5 is the leader of {p5,p6,p7,p8}.
(a) (b) (c)
Figure 6: The effect of leave on a group key-graph of a group G of eight members. (1) The initial keygraph. (2) The tree after member p1 leaves. (3) The merged tree.
This layer is activated upon a Rekey action. The leader receives an ERekeyPrcl event, and starts the OptRekey protocol. Typically, a Rekey will follow a join or a leave. Hence, the group keygraph is initially fragmented. This layer's task is to remerge it. The protocol employed is as follows:
An optimal keygraph is complex to compute, an auxiliary module is used for this task. Note that OptRekey is designed so that only subleaders participate. In the normal case, where a single member joins or leaves, this will include log2n members.
- The leader multicasts Start.
- Subleaders send their keygraphs to the leader.
- The leader computes an optimal new keygraph.
- The leader multicasts the new keygraph.
- Members receive the keygraph and send it up using a ERekeyPrcl event to the RealKeys layer.
It is possible that a Rekey will be initiated even though membership hasn't changed. This case is specially handled, since it can be executed with nearly no communication.
layers/optrekey.ml layers/util/tree.ml,mli layers/type/tdefs.ml,mli
Dn(ECast) Dn(ESend)
The layer comes into effect when a Rekey operation is initiated by the user. It is bounced by the Bottom layer as a Rekey event and received at PerfRekey. From this point, following protocol is used:
In the upcoming view, the new key will be installed.
- The Rekey action is diverted to the leader.
- The leader initiates the rekey sequence by passing the request up to Rekey/OptRekey/Rekey_dt/Rekey_diam.
- Once rekeying is done, the members pass a RekeyPrcl event with the new group key back down.
- PerfRekey logs the new group key. A tree spanning the group is computed through which acks will propagate. The leaves sends Acks up the tree.
- When Acks from all the children are received at the leader, it prompts the group for a view change.
Another rekeying flavor includes a Cleanup stage. Every couple of hours, the set of cached secure channels, and other key-ing material should be removed. This prevents an adversary from using cryptanalysis to break the set of symmetric keys in use by the system. To this end, PerfRekey supports an optional cleanup stage prior to actual rekeying. This is a sub-protocol that works as follows:By default, cleanup is perform every 24hours. This is a settable parameter that the application can decide upon.
- The leader multicasts a Cleanup message.
- All members remove all their cached key-material from all security layers. A ERekeyCleanup event is sent down to Secchan, bounced up to Rekey/OptRekey+RealKeys/.., and bounced back down to PerfRekey.
- All members send CleanupOk to the leader through the Ack-tree.
- When the leader receives CleanupOk from all the members, it starts the Rekey protocol itself.
Rekeying may fail due to member failure or due to a merge that occurs during the execution. In this case, the new key is discarded and the old key is kept. PerfRekey supports persistent rekeying: when the 24hour timeout is over, a rekey will ensue no-matter how many failures occur.
The Top layer checks that all members in a view a trusted. Any untrusted member is removed from the group through a Suspicion event. Trust is established using the Exchange layer, and the user access control policy.
layers/perfrekey.ml
EPrompt ERekeyPrcl Dn(ECast) Dn(ESend)
Upon Up(EInit) event, a member sends a message to the coordinator, claiming that it is in the current view. When a view has the majority of members, its coordinator prompts a view change to make itself the primary partition if it is not yet. When a new view is ready, it decides whether it is primary and mark it as so.
layers/primary.ml
Dn(EPrompt) Dn(ESend)
Whenever the amount of send credits drops to zero, messages are buffered without being sent. On receipt of acknowledgement credit, the amount of send credits are recalculated and buffered messages are sent based on the new credit. Acknowledgements are sent whenever a speicified threshhold is passed.
layers/pt2ptw.ml
In the following, the term acknowledgement is used with the meaning of flow control protocols and not that of reliable communication protocols.
This protocol uses credits to measure the available buffer space at the receiver's side. Each sender maintains a window per each destination, which is used to bound the unacknowledged data a process can send point-to-point to the given destination. For each message it sends, the process deducts a certain amount of credit based on the size of the message. Messages are transmitted only if the sender has enough credit for them. Otherwise, messages are buffered at the sender.
A receiver keeps track of the amount of unacknowledged data it has received from each sender. Whenever it decides to acknowledge a sender, it sends a message containing new amount of credit for this sender. On receipt of an acknowledgement message, sender recalculates the amount of credit for this receiver, and the buffered messages are sent based on the new credit.
The receiver measures the bandwidth of the data being received from each sender. It starts with zero bandwidth, and adjusts it periodically with timeout pt2ptwp_sweep.
On receipt of a point-to-point message, the receiver checks if the sender has passed threshold of its window, i.e. if the amount of data in point-to-point messages received from this sender since the last ack was sent to it has exceeded a certain ratio, pt2ptwp_ack_thresh, of the sender's window. If it is, an ack with some credit has to be sent to the sender. In order to adjust processes' windows according to their bandwidth, the receiver attempts to steal some credit from an appropriate process and add it to the sender's window. The receiver looks for a process with maximal window/bandwidth ratio, decreases its window by certain amount of credit and increases the window of the sender appropriately. Then the receiver sends the sender ack with the new amount of credit. When the process from which the credit was stolen passes theshold of its new, smaller window, the receiver sends ack to it.
layers/pt2ptwp.ml
When a Rekey operation is performed a complex set of layers and protocols is set into motion. Eventually, each group member receives a new keygraph and a set of instructions describing how to merge its partial keytree with the rest of the group keytrees to achieve a unified group tree. The head of the keytree is the group key.
The instructions are implemented in several stages by the subleaders:
- Choose new keys, and send them securely to peer subleaders using secure channels.
- Get new keys through secure channels. Disseminate these keys by encrypting them with the top subtree key, and sending pt-2-pt to the leader.
- When the leader gets all 2nd stage messages, it bundles them into a single multicast and sends to the group.
- A member p that receives the multicast, extracts the set of keys it should know. Member p creates an ERekeyPrcl event with the new group key attached. The event it send down to PerfRekey notifing it that the protocol is complete.
layers/realkeys.ml layers/type/tdefs.ml,mli
ESecureMsg Dn(ECast) Dn(ESend)
When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator generates a new key and sends it to its children using secure channels. The children pass it down the tree. Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event.
The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
layers/rekey.ml
Dn(ECast) Dn(ESend)
When a member layer gets an ERekeyPrcl event, it sends a message to the coordinator to start the rekeying process. The coordinator checks if the view is composed of a single tree-component. If not, it multicasts a Start message. All members that are tree-roots, sends their tree-structure to the leader. The leader merges the trees together, and multicasts the group-tree. It then chooses a new key and sends it down the tree.
Once a member receives the new key is passes it down to PerfRekey using an ERekeyPrcl event.
The PerfRekey layer is responsible for collecting acknowledgments from the members and performing a view change with the new key once dissemination is complete.
layers/rekey_dt.ml
Dn(ECast) Dn(ESend)
The protocol handles merges, partitions, and diamond-graph balancing. It guarantees very-low latency for the case of member leave. We clocked it at four milliseconds on 20 member groups.![]()
Figure 7: Examples for diamonds
layers/rekey_diam.ml
A secure channel between members p and q is created using the following basic protocol:This two-phase protocol is used to prevent the occurrence of a double channel. By this we mean the case where p and q open secure channels to each other at the same time. We augment the Ack phase; q discards p's ticket if:
- Member p chooses a new random symmetric key kpq. It creates a ticket to q that includes kpq using the Auth module ticket facility. Essentially, Auth encrypts kpq with q's public key and signs it using p's private key. Member p then sends the ticket to q.
- Member q authenticates and decrypts the message, and sends an acknowledgment (Ack) back to p.
Secchan also keeps the number of open channels, per member, below the secchan_cache_size configuration parameter. Regardless, a channel is closed if it's lifetime exceeds 8 hours (the setable secchan_ttl parameter). A two-phase protocol is used to close a channel. If members p and q share channel, assuming p created it, then p sends a CloseChan message to q. Member q responds by sending a CloseChanOk to p.
- q has already started opening a channel to p
- q has a larger4 name than p.
It typically happens that many secure channels are created simultaneously group wide. For example, in the first Rekey of a group. If we tear down all these channels exactly 8 hours from their inception, the group will experience an explosion of management information. To prevent this, we stagger channel tear down times. Upon creation, a channel's maximal lifetime is set to 8 hours + I seconds where I is a random integer in the range [0 ..secchan_rand] . secchan_rand is set by default to 200 seconds, which we view as enough.
layers/secchan.ml layers/msecchan.ml
EChannelList ESecureMsg Dn(ECast) Dn(ESend)
One member of the group serves as the sequencer. Any member that wishes to send messages, send them point-to-point to the sequencer. The sequencer then delivers the message localy, and cast it to the rest of the group. Other members, as soon as they receive a cast from the sequencer, they deliver the message.
If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
layers/sequencer.ml
Dn(ECast) Dn(ESend)
The protocol works by broadcasting slander messages to other members whenever it recieves a new Suspect event. On the receipt of such a message, DnSuspect events are generated.
layers/slander.ml
Dn(ESuspect)
The stability protocol consists of each member keeping track of its view of an acknowledgement matrix. In this matrix, each entry, (A,B), corresponds to the number of member B's messages member A has acknowledged (the diagonal entries, (A,A), contain the number of broadcast messages sent by member A). The minimum of column A (disregarding entries for failed members) is the number of broadcast messages from A that are stable. The vector of these minimums is called the stability vector. The maximum of column A (disregarding entries of failed members) is the number of broadcast messages member A has sent that are held by at least one live member. The vector of the maximums is called the NumCast vector [there has got to be a better name]. Occasionally, each member gossips its row to the other members in the group. Occasionally, the protocol layer recomputes the stability and NumCast vectors and delivers them up in an Dn(EStable) event.
layers/stable.ml
Up(EStable) Dn(ECast) Dn(ETimer)
Simple pinging protocol. Uses a sweep interval. On each sweep, Ping messages are broadcast unreliably to the entire group. Also, the number of sweep rounds since the last Ping was received from other members is checked and if it exceed the max_idle threshold then a Dn(ESuspect) event is generated.
layers/suspect.ml
Dn(ESuspect) Dn(ECast) Dn(ETimer)
This protocol is very inefficient and needs to be reimplemented at some point. The Block request is broadcast by the coordinator. All members respond with another broadcast. When the coordinator gets all replies, it delivers up an Up(EBlockOk)
layers/sync.ml
Up(EBlockOk) Dn(EBlock) Dn(ECast)
The protocol here is fairly simple: As soon as the stack becomes valid, the lowest ranked member starts rotating a token in the group. In order to send a message, a process must wait for the token. When the token arrives, all buffered messages are broadcast, and the token is passed to the next member. The token must be passed on even if there are no buffered messages.
If a view change occurs, messages are tagged as unordered and are send as such. When the Up(EView) event arrives, indicating that the group has successfully been flushed, these messages are delivered in a deterministic order everywhere (according to the ranks of their senders, breaking ties using FIFO).
layers/totem.ml
Dn(ECast)
Whenever the number of unstable messages goes above the window, messages are buffered without being sent. On receipt of a stability update, the number of unstable messages are recalculated and buffered messages are sent as allowed by the window.
This layer and its documentation were written by Takako Hickey.
layers/window.ml
It is assumed that an application initiates state-transfer after a view change occurs. In the initial view, xfer_view = true. In a fault free run, each application sends pt-2-pt and multicast messages, according to its state-transfer protocol. Once the application-protocol is complete, an XferDone action is sent to Ensemble. This action is caught by the Xfer layer, where each member sends a pt-2-pt message XferMsg to the leader. When the leader collects XferMsg from all members, the state-transfer is complete, and a new view is installed with the xfer_view field set to false. When faults occur, and members fail during the state-transfer protocol, new views are installed with xfer_view set to true. This informs applications that state-transfer was not completed, and they can restart the protocol.
layers/xfer.ml
This protocol is composed of two sub-protocols structured roughly as in the Internet MUSE protocol. The first protocol is an unreliable multicast protocol which makes a best-effort attempt to efficiently deliver each message to its destinations. The second protocol is a 2-phase anti-entropy protocol that operates in a series of unsynchronized rounds. During each round, the first phase detects message loss; the second phase corrects such losses and runs only if needed.
layers/zbcast.ml
Dn(ECast) Dn(ESend) Up(ELostMessage)
[TODO: here describe the overall protocol created by composing all the protocol layers]
name purpose LEAVE reliable group leave INTER inter-group view management INTRA intra-group view management ELECT leader election MERGE reliable group merge SYNC view change synchronization PT2PT FIFO, reliable pt2pt SUSPECT failure suspcions STABLE broadcast stability MNAK FIFO, agreed broadcast BOTTOM bare-bones communication
Table 3: Virtual synchrony protocol stack
[TODO: composition of protocols below]