----STARTOFSESSION--------------------

Indranil's scribed notes for:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Process Group approach to reliable Distributed Computing

Kenneth P. Birman,

CACM dec 1993 vol 36 no 12

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- reviews 10 years of research on ISIS, a system that provides tools to support the construction of reliable distributed software.

-thesis : developement of reli dist sw can be simplified using process groups and group programming tools.

-goals of this article: system, approach taken, experieneces with real-life apps

ex. brokerage and trading systems:

- information backplane: ability to publish and subscribe mesgs - customization of system - hierarchical structure - lan and wan

reli in this ==> replicating the data

most systems like modern telecommunications system, etc have the same issues in them - thus needed: a technology that makes it easy to solve these problems.

isis' way of solving these problems: distributed groups of copperating programs. in fact publish-subscribe => groups. process groups appear in all sorts of apps. but no support. programmer -> bad.

Process groups:

~~~~~~~~~~~~~

two styles of process groups:

1. anonymous groups

should provide foll properties: -send (gp addr) -all or none, exactly once delivery of mesgs -mesg delivery order -history (of mesgs, events) consistently reflecting in current state (across all processes in group)

2. explicit groups

members cooperate directly; employ algos that use lists of members, relative rankings in list etc.

additional needs: membership change needs to be published to group. mebership seen by all group members needs to be consistent.

Thus, technical problems to be considered in sw for process groups:

-support for group communcn (addressing, failure atomicity, mesg delivery order) -use of gp membership as input (to a distbd algo) -synchronization

integrate the solns to these problems: VSynch

Building distributed services over conventional technologies: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

conventional mesg passing technologies:

---------------------------------------

1. unreliable datagrams 2. rpc - sender unable to distinguish cause when failure occurs 3. reliable data streams - generally outperform rpc for large data, but same problem as rpc. fact - conditions under which stream breaks defined by stds ==> pb.

assump system: wan with lans; all mesg errors possible in lan

failure model:

--------------

fail-crash

transient failure - problem between accurate and timely failure. soln: once a process halts 'somewhat' it is thrown out of group with all the data it recorded. rejoin == new entry. [fail-stop model]

building groups over conventional technologies:

-----------------------------------------------

group addressing

````````````````

maintaining gp membership

- centralized pgm w clls to register, query, fwd mesg to a gp, rmv gp member pb - ft. suggestion: map mcasts -> reads and membership changes -> writes (creates notion of mcasts only when membership not changing)

- isis impls above not with locks but by replicating membership info across members

logical time and causal dependency

``````````````````````````````````

'reaching all of its members at the same time' - defined by Lamport's notion of causal event ordering. thus delivery events in each mcast events are concurrent and mcasts are totally ordered.

message delivery ordering:

``````````````````````````

(a problem we might tend to ignore)

soln: each mesg contains a context record size linear in number of gp members (worst case - multiple context records per mesg) - used to delay delivery until prior mesgs delivered.

pb: m1 and m2 though concurrent, have to be ordered at all gp members in the same way

state transfer:

```````````````

fig 4B: ==> complex synchronization algo (beyond the ability of typical programmer)

fault tolerance:

````````````````

pb: sender fails after mcast has been delivered to some of the recvrs. aim - to maintain all or none delivery.

soln (complex protocols): a simple solution: three round mcast protocol (fig 5) with a dest (lowest one) taking over if sender fails. expensive protocol.

summary: pbs confronting a gp-sw developer compared to a conventional OSdev-per ```````````````````````````````````````````````````````````````````````````````

1. weak support for reliable commun (ex. channels break) 2. gp address expansion 3. delivery ordering for concurrent mesgs 4. ' ' ' ' related mesgs 5. state transfers 6. failure atomicity.

others overlooked: real-time delivery guarantees, persistence of databases and files.

- advantage of having the process gp tool to programmer

Virtual Synchrony:

******************

motivated by work on trans serializability

basic idea - make programmers assume a closely synchronized style of distbd execution: - process = seq of events - all events seen in same order by all processes - send and delivery considered as single instantaneous events

Close Synchrony:

```````````````

("IDEAL" !)

solves all above problems. features: - weak communicn reliablity guarantees - group address expansion - delivery ordering for concurrent mesgs - delivery ordering for seqs of related mesgs - state transfer - failure atomicity

pbs with close synchrony:

- not practical - simple but expensive

Virtual Synchrony

`````````````````

permits asynchronous executions for which there exists some closely synchronous execution that is indistinguishable. synch events only to the degree the ap is sensitive to event ordering.

order sensitivity in distributed systems:

- - - - -

"When can synchronization be relaxed in a virtually synchronous distbd system ?"

- abcast - 'atomic delivery ordering' - primitive that is easy, expensive to impl (since need to check for earlier mesgs, may need to buffer these mesgs), high latencies.

- cbcast - weaker than abcast, permits mesgs that were sent concurrently to be delivered to overlapping destinations in different seqs, lower latency (cbcas mesg can be delivered as soon as any prioir mesgs have been delivered)

ex. locks

just mention: close synchrony<--> Schneider's state machine approach. weak consistency, lazy update.

Summary of benefits due to Virtual Synchrony:

`````````````````````````````````````````````

- code can be developed assuming a closely synchronous model - supports meaningful notion of group state and state transfer (for both data replication and computation being dynamically partitioned among gp members) - asynchonous pipelined communicn - treatment of communicn, process gp mmbrship changes, failures thru a single event-oriented exe model - failure handling thru consistently presented gp memberhsip list integrated into the communicn system (compare with usu approach of sensing failures thru timeouts)

Limitations:

```````````

- reduced av. during lan partitions : allows progress in single partitions only. (primary partition). thus tolerates <= [n/2] - 1 simultaneous failures.

further notes:

`````````````

(from table on page 44)

- checkpoint/update logging, spooling for state recovery from failure - token passing for synchronization - monitor sites for failure - for first member, use init from 1. sw or 2. logs for future members, use join(+tranfer state)

The Isis toolkit:

*****************

Others: V, Amoeba, Chorus, IBM's AAS, Transis etc.

Isis - first to propose VSynchronous model and offer hi-perf, consistent solns to a variety of pbs.

Styles of groups:

`````````````````

tradeoff between simple interface and av. of accurate info abt gp membership for address exp in mcast.

Isis optimized to handle each of 4 types of groups (anonymous and explicit not visible at this level)

- peer group - cli-serv group - diffusion group - hierarchical group

Toolkit interface:

``````````````````

- toolkit includes asynchronous impls of more imp distbd programming paradigms like am replication tool for managing replicated data, sych tool that supports a form of locking etc.

- exs in figures 10 & 11

Who uses Isis and How ?

***********************

*Brokerage systems: - brokerage systems, requires f-t (@ file/db level or at lower local file level) - Isis solves first one and provides several tools for the latter

-key services -> process groups, for f-t, may improve response time. Ques --> when does ov'hd outweigh benefits of concurrency ? f-t is somewhat of a side effect of the replication approach.

-subscribe/publish facility reqd in stock exchanges

-isis netwrok resource mgr.

*Database repliation and triggers: -isis for contructing distbd db apps -replicate db services -> process groups -note: recovery of servers -> use checkpoints and logs

-isis support for triggers (repeatedly eval'd for true) - notifying apps when trigger becomes true etc

*Major isis based utilities:

-NEWS ex for brokerage systems -NMGR - manages batch style jobs and performs load sharing in a DS (ex. parallel make/compilation) -DECEIT f-t NFS compatible file storage - META/LOMITA (systems for building f-t reactive control apps): lower layer of sensors and actuators -> higher abstraction layer of sensors and actuators (META) -> di lang for specifying cttrl actions (LOMITA)

ctrl stmts (sort of like queries to sensors) --translated to--> di fsms. process grps are used to impl aggregates, perform state transitions and finally apps are notified.

-SPOOLER saving mesgs to gps that are active only periodically

*Other ISIS apps:

-telecommunications switching -intelligent network apps -military systems (AEGIS) -... -CERN

Other issues (not considered) in Isis:

~~~~~~~~~~~~~~~~~~~~~~~~~~~

-real time -how does isis compare with the transactional model (vsynch <-> transaction serializability): mechs are similar, uses are different. persistency not a conern in isis. ex in paper in last line of section if you're interested.

Conclusion:

**********

;-)

-interesting points: isis performs ~rpc+streams (std technology) -isis microkernel ???, security+real time too !?

----ENDOFSESSION--------------------------------------