The Process Group Approach to Reliable Distributed Computing

Notes by Li Li (3/5/98)


Idea

The idea is simple: using process replication to achieve reliability. But to get it right, many technical and theoretical issues involved.

Issues involved in Process Group Approach

Theoretical

Group membership service handles operations of process: joining/leaving a group, ensures processes agree on the current membership of the group. In theory, GM can not be solved in asynchronous systems with even a single crash failure. In practice, with reasonable assumption of the system, GM can be solved with a weakest failure detector. The properties of the failure detector are: weak completeness: every process that crashes is eventually suspected by some correct process. Eventual weak accuracy: there is a time , after which, there is a correct process that is not suspected.

Can protect against Heisnebugs

More asynchronous, hence better performance

Can not be applied directly in a practical setting

Maintain close synchrony is expensive

Allow code to be developed assuming a simplified, closely synchronous execution model

Support group state and state transfer

Asynchronous, pipelined communication

Treatment of communication, process group membership changes and failures through a single, event oriented execution model

Failure handling through a consistently presented system membership list integrated with the communication subsystem

Only allow progress in primary partition(in process group toolkits such as TRANSIS, partitionable model is used, so progress can be made in each partition, but only suited for specific applications.)

Risks incorrectly classifying an operational site or process as faulty

Practical

Group addressing: Mapping a group address to a membership list. Provided by Group Membership Service. ISIS replicates knowledge of the membership among the members of the group itself.

Message delivery ordering: in order to maintain consistency among group member in a series of message delivery, messages must be ordered. Ordering primitives: ABCAST, CBCAST, etc

State transfer: in order to maintain consistency among group member during state transfer, synchronization algorithm must be used.

Fault tolerance: when sender of a message were to crash after some, but not all destinations receive the message. The destinations that do have a copy must complete the transmission or discard the message. This can be achieved by protocols such as the three-round reliable multicast.

ISIS system and applications built around it

Four Styles of Groups: peer groups, client-server groups, diffusion groups and hierarchical groups

provide a synchronization tool that supports a form of locking, a replication tool for managing replicated data, a tool for fault tolerant primary-backup server design.

Brokerage

Database Replication and triggers

NEWS

NMGR

DECEIT

META/LOMITA

SPOOLER/LONG-HAUL FACILITY

Discussion

Since GM is theoretically impossible. How well can we solve it in practice? Also how does this relate to the reliability of the process group approach?

Since process group approach actively and constantly exchange messages between group members, it is fairly expensive compared to other fault tolerance approaches such as transactions, check pointing/logging, if applications need reliability guarantee, what makes them use process group approach instead of the other two approach? Also how scalable is this approach?

Is toolkits like ISIS (i.e. the abstraction provided by ISIS) the right kind of tools for application developers? What does the real world applications really want out of the Process Group toolkits?