Robbert van Renesse

Introduction/History

Building intuition

Case studies:
- Failure Detection
- Bimodal Multicast
- Monitoring/Resource Location

Partitioning

Security

Data Dissemination

Want efficiency, robustness, speed, scale

Tree distribution is efficient, but fragile
- (plus configuration is difficult)

Flooding is robust, but inefficient

Gossip is both efficient and robust, but has relatively high latency

Tree + Gossip is a good combination

History

Ladies and Telephones (MIT, 1972)

Grapevine/Clearinghouse Directory Service (Demers, Xerox PARC, 1987)

Refdbms (Golding, UCSC, 1993)

Bayou (Xerox PARC, 1995)

Bimodal Multicast (Cornell, 1998)

Astrolabe (Cornell, 1999)

How does gossip spread?

State Monotonic Property

A gossip message contains the state of the sender of the gossip.

The receiver uses a Merge function to merge the received state and the sent state:
- State’ = Merge(State, Gossip)

Need some kind of monotonicity:
- State’ ? State
- State’ ? Gossip

Anti-Entropy

This gossip scheme with monotonic merge is sometimes called anti-entropy.

The protocol is called a simple epidemic.

How fast does gossip spread?

Epidemic theory (e.g., Bailey …)

Assume a fixed population of size n.

For now, assume homogeneous spreading
- simple epidemic: anybody can infect anyone else with equal probability

Assume k members already infected.

Assume infection occurs in rounds.

Probability of Infection?

What is the probability Pinfect(k, n) that a particular uninfected member is infected in a round if k are already infected?

Pinfect(k, n)

= 1 – P(nobody infects member)

= 1 – (1 – 1/n)^k

E(#newly infected members) =

(n – k) ? Pinfect(k, n)

(binomial distribution)

Intuition: 2 phases

Phase 1: 1 ? n/2 (first half)

Phase 2: n/2 ? n (second half)

For large n, Pinfect(n/2, n) ? 1 ? (1/e)^.5 ? .4

Half way:

Infection grows by factor 1.4

Uninfection declines by factor .4

Phase 1: fast growth of infection

Initial rate of growth: factor of 2

Half way: factor of 1.4

(Near end, factor ? 1)

Bound above and below by exponential growth

Phase 2: fast decline of uninfection

Starts very slowly

Halfway, decline is .4

Last infection: Pinfect(n-1,n) ? 1 ? 1/e ? .63

Bound above and below by exp. decline

Exponential growth

Taken together: #rounds necessary to infect the entire population grows O(log n)

Base of log: 1.585 (experimental)

Even under bad conditions (see later):
- member failures
- message loss
- but base of log decreases

#rounds distribution

Expected #rounds

Case Study 1: Failure Detection

Accurate fd difficult, if not impossible

Mistakes can be allowed

Useful for
- system administration
- replication
- load balancing
- group communication

Existing FDs slow or too unreliable

Informal Properties

Mistake probability fixed
- (independent of #members)

Scales in #members (O(nlogn))

Scales in bandwidth (O(n))

Resilient against message loss

Resilient against crashes

Environment

Crash failures and partitions

Unbounded message delay

Negligible clock drift

Basic Gossip Protocol

Each member maintains a list of (address, heartbeat) pairs

Periodically, each member gossips:
- increments its own heartbeat
- sends list to randomly chosen member

On receipt of gossip, merge lists

Each member maintains last time heartbeat increased for each other

Linear Bandwidth

Gossip message grows linearly with n

#members grows linearly with n
- Slow down gossiping linearly:

How long to wait before reporting failure?

Model

Each micro-round one random member gossips to another random one.

We track “infection” of one heartbeat of one member.

Calculate probability that k members are infected in micro-round i:

f members failed from start

Failure Caveat

Assume initial member does not fail

This affects outcome by at most one round:
- Initially infected member would have to crash right after it gossips
- So does the recipient of the gossip, and so on.

Analysis

Failure Detection Time

B = 250 bytes/sec/member

Quality of Detection

Effect of Failed Members

Effect of Message Loss

Case Study 2: Bimodal Multicast

Problem with gossip: high latency

Problem with tree-based multicast: fragile

Combination provides best of both worlds

Reliable Multicast:

PPT Slide

Throughput as one member of a multicast group is "perturbed" by forcing it to sleep for varying amounts of time.

PPT Slide

Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

PPT Slide

Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.

An aside: Push vs. Pull

Sending all state in a gossip message can be too much overhead and wasteful.

Instead, can send a version number or a timestamp, and recipient can request the information desired.

PPT Slide

Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

PPT Slide

Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

Example

Process A

Process B

Process C

Process D

C multicasts

Gossip

Retransmission

time

Delivery? Garbage Collection?

Deliver a message when it is in FIFO order

Garbage collect a message when you believe that no “healthy” process could still need a copy with high probability.

Match parameters to intended environment

Throughput with perturbed processes

But, gossip doesn’t scale...

Requires membership
- doesn’t scale

Load on network grows quickly
- linear if one source of information
- quadratic if all participants can contribute

Led to demise of Xerox Clearinghouse
- (and the victory of DNS)

High load on routers

subnet

backbone

Problems with partitions

Idea: add locality to gossip

Gossip mostly in your neighborhood

Occasionally gossip farther away

Generalize to multiple levels

Resembles spread of (real) virusses

Domains

Smallest domain: local host

Largest domain: all hosts

Domains form a tree

Multi-level Gossip Protocol

Start with local domain

Pick a member at random

If picked self, go to next level up
- If no more levels, don’t gossip

Send gossip to chosen member
- pick random subdomain in chosen member
- if not host-level, then descend into subdomain
- otherwise send message

Better properties

Most gossips are local

Fewer problems with partitioning

At every level, about the same gossip load

Within any domain, there is, on average, one gossip message from every node to every other node

But, propagation is slower:

Two-level hierarchy

Two-level cost

No longer logarithmic...

#levels in the domain tree is O(log n)

resulting growth, log^log, is polynomial

Problems

Polynomial growth
- (degree is small though, like .2)
- if n = 1,000,000,000, branching factor is 100, and gossip every second, dissem time < 10 min.

Still requires full membership

Message sizes may grow linearly if everybody contributes information (e.g., a sequence number for each member)

New idea

Reduce information content with distance
- e.g., go from exact values to average values
- from exact membership to representatives
- use distance metric in the domain tree

Case Study 3: Astrolabe

Scalable Monitoring of Resources
- monitoring info stored in (virtual) tree called the Hierarchical Management Information Base
- leaf nodes formed by hosts
- internal nodes correspond to domains

Resource Location based on this info

Uses gossip and mobile code

Example HMIB

HMIB tree

Each node contains a table: rows are MIBs, columns are types of resource information.

Each machine only stores those nodes on the path to the root of the tree.

Internal nodes are calculated by a so-called condensation function (average, total, first three, …)

HMIB node/table

Each table has a row for each child domain

Required to have two columns at least:
- ID: a unique identifier for the domain
- contacts: a (short) list of addresses of representative members. Typically uses three members of the child domain.

Example: digital library

Domain2

Domain3

Condensation

Index1 in Domain1

Index2 in Domain3

Applications

This info may be used do:
- locate Index servers
- locate hot-spots
  - how many replicas are needed?
  - where should they run?
- balance load

Other applications

Cluster management
- where to run servers

Reliable multicast
- membership
- routing
- retransmission

Finding resources
- e.g., closest idle workstation or printer

Implementation

Timestamp rows in tables with wall time

Occasionally gossip
- multi-level gossip scheme
- use contacts instead of full membership
- gossip list of (ID, timestamp) pairs

On receipt of gossip, return updates

On receipt of update, merge into state
- and evaluate condensation functions

Membership issues

In a large distributed application, machines will be joining and leaving all the time.

Combined crash rate goes up linearly.

Partitions become more likely with scale.

Luckily, in Astrolabe only part of the membership is visible.

Failure detection

Include “Wall-Time” column in tables.

When the value of Wall-Time goes below some amount before the current time, remove the corresponding row.

Recalculate condensation function to update “Contacts”.

Member detection /Partition resolver

Each member occasionally multicasts a gossip message.

Rate depends on 1/#members
- say every second decide to multicast with probability 1/#members.

Maintain a “#members” column in each table, summing the “#members” in the corresponding child domains.

Condensation Functions

Coded in a safe language

Stored in tables themselves
- updates propagate in the normal way
- that is, the code is mobile

A few word on security

It’s hard to keep gossip from spreading

That’s good, because an adversary can’t stop the flow of information.

That’s bad, because an adversary can spread false updates, or invent new domains.

It’s possible using public key certificates and signatures to prevent this.

Related Literature

The Mathematical Theory of Infectious Diseases and its Applications. N.T.J. Bailey. Hafner Press. 1975.

Epidemic Algorithms for Replicated Database Maintenance. A. Demers et al. Proc. of the 6th ACM PODC conf. August 1987.

A Weak-Consistency Architecture for Distributed Information Services. R.A. Golding. Computing Systems 5(4), Fall 1992.

Flexible Update Propagation for Weakly Consistent Replication. K. Petersen et al. Proc of the 16th ACM SOSP conf. October 1997.

My home page: http://www.cs.cornell.edu/home/rvr