Epidemic Protocols(or, Gossip is Good)
Overview
- Case studies:
- Failure Detection
- Bimodal Multicast
- Monitoring/Resource Location
Data Dissemination
- Want efficiency, robustness, speed, scale
- Tree distribution is efficient, but fragile
- (plus configuration is difficult)
- Flooding is robust, but inefficient
- Gossip is both efficient and robust, but has relatively high latency
- Tree + Gossip is a good combination
History
- Ladies and Telephones (MIT, 1972)
- Grapevine/Clearinghouse Directory Service (Demers, Xerox PARC, 1987)
- Refdbms (Golding, UCSC, 1993)
- Bimodal Multicast (Cornell, 1998)
- Astrolabe (Cornell, 1999)
How does gossip spread?
State Monotonic Property
- A gossip message contains the state of the sender of the gossip.
- The receiver uses a Merge function to merge the received state and the sent state:
- State’ = Merge(State, Gossip)
- Need some kind of monotonicity:
- State’ ? State
- State’ ? Gossip
Anti-Entropy
- This gossip scheme with monotonic merge is sometimes called anti-entropy.
- The protocol is called a simple epidemic.
How fast does gossip spread?
- Epidemic theory (e.g., Bailey …)
- Assume a fixed population of size n.
- For now, assume homogeneous spreading
- simple epidemic: anybody can infect anyone else with equal probability
- Assume k members already infected.
- Assume infection occurs in rounds.
Probability of Infection?
- What is the probability Pinfect(k, n) that a particular uninfected member is infected in a round if k are already infected?
= 1 – P(nobody infects member)
E(#newly infected members) =
Intuition: 2 phases
- Phase 1: 1 ? n/2 (first half)
- Phase 2: n/2 ? n (second half)
- For large n, Pinfect(n/2, n) ? 1 ? (1/e)^.5 ? .4
- Infection grows by factor 1.4
- Uninfection declines by factor .4
Phase 1: fast growth of infection
- Initial rate of growth: factor of 2
- Bound above and below by exponential growth
Phase 2: fast decline of uninfection
- Last infection: Pinfect(n-1,n) ? 1 ? 1/e ? .63
- Bound above and below by exp. decline
Exponential growth
- Taken together: #rounds necessary to infect the entire population grows O(log n)
- Base of log: 1.585 (experimental)
- Even under bad conditions (see later):
- member failures
- message loss
- but base of log decreases
#rounds distribution
Expected #rounds
Case Study 1: Failure Detection
- Accurate fd difficult, if not impossible
- Useful for
- system administration
- replication
- load balancing
- group communication
- Existing FDs slow or too unreliable
Informal Properties
- Mistake probability fixed
- (independent of #members)
- Scales in #members (O(nlogn))
- Scales in bandwidth (O(n))
- Resilient against message loss
- Resilient against crashes
Environment
- Crash failures and partitions
Basic Gossip Protocol
- Each member maintains a list of (address, heartbeat) pairs
- Periodically, each member gossips:
- increments its own heartbeat
- sends list to randomly chosen member
- On receipt of gossip, merge lists
- Each member maintains last time heartbeat increased for each other
Linear Bandwidth
- Gossip message grows linearly with n
- #members grows linearly with n
- Slow down gossiping linearly:
How long to wait before reporting failure?
Model
- Each micro-round one random member gossips to another random one.
- We track “infection” of one heartbeat of one member.
- Calculate probability that k members are infected in micro-round i:
- f members failed from start
Failure Caveat
- Assume initial member does not fail
- This affects outcome by at most one round:
- Initially infected member would have to crash right after it gossips
- So does the recipient of the gossip, and so on.
Analysis
Failure Detection Time
Quality of Detection
Effect of Failed Members
Effect of Message Loss
Case Study 2: Bimodal Multicast
- Problem with gossip: high latency
- Problem with tree-based multicast: fragile
- Combination provides best of both worlds
- Reliable Multicast:
1. initial multicast (unreliable)
2. repair phase (retransmission)
3. garbage collection (to release buffers)
PPT Slide
Throughput as one member of a multicast group is "perturbed" by forcing it to sleep for varying amounts of time.
PPT Slide
Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)
PPT Slide
Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.
An aside: Push vs. Pull
- Sending all state in a gossip message can be too much overhead and wasteful.
- Instead, can send a version number or a timestamp, and recipient can request the information desired.
PPT Slide
Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip
PPT Slide
Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.
Example
Delivery? Garbage Collection?
- Deliver a message when it is in FIFO order
- Garbage collect a message when you believe that no “healthy” process could still need a copy with high probability.
- Match parameters to intended environment
Throughput with perturbed processes
But, gossip doesn’t scale...
- Load on network grows quickly
- linear if one source of information
- quadratic if all participants can contribute
- Led to demise of Xerox Clearinghouse
High load on routers
Problems with partitions
Idea: add locality to gossip
- Gossip mostly in your neighborhood
- Occasionally gossip farther away
- Generalize to multiple levels
- Resembles spread of (real) virusses
Domains
- Smallest domain: local host
- Largest domain: all hosts
Multi-level Gossip Protocol
- If picked self, go to next level up
- If no more levels, don’t gossip
- Send gossip to chosen member
- pick random subdomain in chosen member
- if not host-level, then descend into subdomain
- otherwise send message
Better properties
- Fewer problems with partitioning
- At every level, about the same gossip load
- Within any domain, there is, on average, one gossip message from every node to every other node
- But, propagation is slower:
Two-level hierarchy
Two-level cost
No longer logarithmic...
- #levels in the domain tree is O(log n)
- resulting growth, log^log, is polynomial
Problems
- Polynomial growth
- (degree is small though, like .2)
- if n = 1,000,000,000, branching factor is 100, and gossip every second, dissem time < 10 min.
- Still requires full membership
- Message sizes may grow linearly if everybody contributes information (e.g., a sequence number for each member)
New idea
- Reduce information content with distance
- e.g., go from exact values to average values
- from exact membership to representatives
- use distance metric in the domain tree
Case Study 3: Astrolabe
- Scalable Monitoring of Resources
- monitoring info stored in (virtual) tree called the Hierarchical Management Information Base
- leaf nodes formed by hosts
- internal nodes correspond to domains
- Resource Location based on this info
- Uses gossip and mobile code
Example HMIB
HMIB tree
- Each node contains a table: rows are MIBs, columns are types of resource information.
- Each machine only stores those nodes on the path to the root of the tree.
- Internal nodes are calculated by a so-called condensation function (average, total, first three, …)
HMIB node/table
- Each table has a row for each child domain
- Required to have two columns at least:
- ID: a unique identifier for the domain
- contacts: a (short) list of addresses of representative members. Typically uses three members of the child domain.
Example: digital library
Applications
- This info may be used do:
- locate Index servers
- locate hot-spots
- how many replicas are needed?
- where should they run?
- balance load
Other applications
- Reliable multicast
- membership
- routing
- retransmission
- Finding resources
- e.g., closest idle workstation or printer
Implementation
- Timestamp rows in tables with wall time
- Occasionally gossip
- multi-level gossip scheme
- use contacts instead of full membership
- gossip list of (ID, timestamp) pairs
- On receipt of gossip, return updates
- On receipt of update, merge into state
- and evaluate condensation functions
Membership issues
- In a large distributed application, machines will be joining and leaving all the time.
- Combined crash rate goes up linearly.
- Partitions become more likely with scale.
- Luckily, in Astrolabe only part of the membership is visible.
Failure detection
- Include “Wall-Time” column in tables.
- When the value of Wall-Time goes below some amount before the current time, remove the corresponding row.
- Recalculate condensation function to update “Contacts”.
Member detection /Partition resolver
- Each member occasionally multicasts a gossip message.
- Rate depends on 1/#members
- say every second decide to multicast with probability 1/#members.
- Maintain a “#members” column in each table, summing the “#members” in the corresponding child domains.
Condensation Functions
- Stored in tables themselves
- updates propagate in the normal way
- that is, the code is mobile
A few word on security
- It’s hard to keep gossip from spreading
- That’s good, because an adversary can’t stop the flow of information.
- That’s bad, because an adversary can spread false updates, or invent new domains.
- It’s possible using public key certificates and signatures to prevent this.
Related Literature
- The Mathematical Theory of Infectious Diseases and its Applications. N.T.J. Bailey. Hafner Press. 1975.
- Epidemic Algorithms for Replicated Database Maintenance. A. Demers et al. Proc. of the 6th ACM PODC conf. August 1987.
- A Weak-Consistency Architecture for Distributed Information Services. R.A. Golding. Computing Systems 5(4), Fall 1992.
- Flexible Update Propagation for Weakly Consistent Replication. K. Petersen et al. Proc of the 16th ACM SOSP conf. October 1997.
- My home page: http://www.cs.cornell.edu/home/rvr