A group is an addressing abstraction used to refer to a collection of group members. A group member is a communications endpoint which can originate and deliver messages. Processes (or more precisely, communications endpoints owned by processes) can join or be added to a group, leave a group, or be dropped from a group because of failure. These operations cause the membership of the group to change.
The view of the group is a snapshot of the group membership at a specified point in the execution of a process. As execution proceeds, a member will see group membership changes as a succession of views. Views are reported to the group members concurrently and asynchronously. Thus, at any instant in real time the group members could have different views of the group. The Horus protocols attempt to deliver the same sequence of views to each member, and, if successful, guarantee that each member see the same set of messages between views. Horus can thus implement a variety of process-group execution models, including the virtual synchrony model first introduced by the Isis toolkit. This model was also adopted by [2] and [16].
Horus can be configured to allow progress during transient failure and network partitions, using a variation of protocols proposed by Transis [1]. When a network partition failure occurs, a single group may split into multiple subgroups: one primary and others non-primary subgroups. Group members in different subgroups will then observe different sequences of views. When the partition is repaired a non-primary subgroup can heal itself by merging with the primary subgroup. This contrasts with Isis which only allows the primary partition to continue execution.
The primary partition is usually the majority partition, and is typically defined at the machine level and not at member or process levels. ``Primaryness'' is detected for sets of machines and all the groups in a given partition inherit the primaryness attribute of that partition. Horus tracks primaryness and reports the value to members through a primary bit associated with the group view. If an application is programmed to shut down whenever a group view is delivered with the primary bit clear, the behavior is as that of Isis.
On the other hand, if an application wishes to tolerate partition failures, it can continue execution in groups for which the bit is cleared. Horus continuously seeks out and attempts to merge partitions. When partition merge occurs, there are three possible cases:
Communication to a group is by group multicast. In the absence of failure, a group multicast is delivered to all group members in the view of the sender. When failures occur, a modification of this rule applies: if a message is delivered to one reachable member, it will be delivered to all reachable members. Specifically, when a process has problems communicating with another process in its view, Horus will attempt to install a new view excluding this member. Horus synchronizes with the other reachable members in the view so that all these members install the same new view. Horus guarantees that if two processes were in one view, and agree on installing a new second view, that those two processes will deliver exactly the same set of messages. This is a type of message atomicity called virtually synchronous group addressing in the Isis model. When Horus is configured to support network partitioning, the execution model that results is the extended virtual synchrony model [8][7].