CS5410: Fault-tolerant Distributed Computer Systems -- Topic Outline

Here is a high-level listing of the topics we will try to cover this semester. Recommended readings are still being updated.  Ken's book covers a lot of these topics, but in a different order and with a slightly different emphasis (he's revising it but the new edition won't be out for a while)

  1.  [9/1/08] Cloud and edge computing: Two side-by-side revolutions
  2. [9/3/08] Web Services and SOA standards.  CORBA and OO standards
  3. [9/8/08] Key components of cloud computing platforms
  4. [9/10/08] Cloud computing applications and Map-Reduce
  5. [9/15/08] Thinking about distributed systems: Models of time and event ordering
  6. [9/17/08] Clock synchronization and the limits of real-time
  7. [9/22/08] Consensus on event ordering: The GMS Service(1)
  8. [9/24/08] The GMS Service(2)
  9. [9/29/08] State machine concept.  Possible functionality that our GMS can support
  10. [10/1/08] Recap: replication models/protocols.  Maelstrom [M07] and Ricochet [R08]: Time critical replication between and within data centers
  11. [10/6/08] Replication with stronger semantics outside the GMS service: Virtual synchrony, Paxos
  12. [10/8/08] Transactional subsystems and Web Services support for the transactional model.  How transactional servers are implemented

No class Oct 11-Oct 14 (fall break)

  1. [10/15/08] eBay architecture and design principles (material from a talk by Randy Shoub/eBay)
  2. [10/20/08] The Power Challenge in cloud platforms (material from a talk by James Hamilton/MSFT)
  3. [10/22/08] DHTs.  Chord, Pastry, Kelips
  4. [10/27/08] Challenges of scalable state replication and system monitoring.  Astrolabe
  5. [10/29/08] Building overlay networks.  Resilient Overlay Networks. 
  6. [11/03/08] DHT-based overlays.  Scribe over Pastry [S6].  Bologna's T-Man system [TM]
  7. [11/05/08] Pub-sub overlays.  Gryphon [GRY]. Sienna [SNA].
  8. [11/10/08] Cooperative storage systems: PAST [PST], CFS [CFS].  Censorship-Resistant Storage: Tangler [TAN].
  9. [11/12/08] Cooperation versus Punishment: BitTorrent [BT].  BAR Gossip [BG]
  10. [11/17/08] Sybil attacks and reputation tracking.  [S1-S5]
  11. [11/19/08] Failure models and failure detectors.  Pros and cons of detecting problems versus masking them.  [FD1-FD2]
  12. [11/24/08] Trusted computing issues seen in cloud settings.  Practical Byzantine Agreement [ZY08]

**       [11/26/08] Ask the ObjectMaster: Special session on application design with live objects for HW3

No class Nov 27-30 (Thanksgiving break)

  1. [12/1/08] Dr. Multicast.
  2. [12/4/08] Live distributed objects. Fusing synthetic and real-world event-stream data

Project demo day: December 8

References

[B05] Ken Birman.,  Reliable Distributed Systems.  Springer-Verlag, (May 2005).

[M07]  Maelstrom: Transparent Error Correction for Lambda Networks.  Mahesh Balakrishnan, Tudor Marian, Ken Birman, Hakim Weatherspoon, Einar Vollset.  USENIX Symposium on Networked System Design and Implementation (NSDI 08). April 2008. 

[R08]  Ricochet: Lateral Error Correction for Time-Critical Multicast.  Mahesh Balakrishnan, Ken Birman, Amar Phanishayee, and Stefan Pleisch.  To Appear in Proceedings of the 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). Cambridge, MA. April 2007.

[S1]   http://pdos.csail.mit.edu/papers/sybil-dht-socialnets08.pdf

[S2]   http://thrackle.eas.asu.edu/users/goran/papers/dc06.pdf

 

[S3]   http://www.si.umich.edu/~presnick/papers/cacm00/reputations.pdf

 

[S4]   http://www.cs.ucl.ac.uk/staff/d.quercia/publications/quercia07lightweight.pdf

 

[S5]   Credence: http://www.cs.cornell.edu/People/egs/credence/

 

[S6]  SCRIBE: A large-scale and decentralised application-level multicast infrastructure.  M. Castro, P. Druschel, A-M. Kermarrec  and A. Rowstron, IEEE Journal on Selected Areas in Communications (JSAC) (Special issue on Network Support for Multicast Communications). 2002

 

[TM] T-Man: Fast Gossip-based Construction of Large-Scale Overlay Topologies.  Mark Jelasity, Ozalp Babaoglu.  Technical Report UBLCS-2004-7, May 2004. 

 

[GRY] Gryphon: An information flow based approach to message brokering.  Robert Strom, Guruduth Banavar, Tushar Ch, Marc Kaplan, Kevan Miller, Bodhi Mukherjee, Daniel Sturman, Michael Ward.  International Symposium on Software Reliability Engineering 1998.

 

[SIE]  (Too) many papers are available at http://serl.cs.colorado.edu/~serl/dot/siena.html

 

[PST] Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. Ant Rowstron and Peter Druschel.  18th SOSP, Banff, Alberta, Canada , 2001.

 

[CFS] Wide-area cooperative storage with CFS.  Frank Dabek, Frans Kaashoek, David Karger, Robert Morris, Ion Stoica. 18th SOSP, Banff, Alberta, Canada , 2001.

 

[TAN]  Tangler home page has several papers: http://www.scs.cs.nyu.edu/tangler/

[FD1] http://www.cs.cornell.edu/home/sam/FDpapers.html

[FD2] http://portal.acm.org/citation.cfm?doid=1052796.1052806

 

[BT] Bit Torrent (protocol)

 

[BG] BAR Gossip.  Harry C. Li, Allen Clement, Edmund L. Wong, Jeff Napper, Indrajit Roy, Lorenzo Alvisi, Michael Dahlin. OSDI 2006.

 

[ZY08] Zyzzyva: Speculative Byzantine Fault Tolerance.  R. Kotla, A. Clement, E. Wong, L. Alvisi, M. Dahlin.  Comm. of the ACM.  51:11 (Nov. 2008), 86-95.  Revised from a version that appeared in SOSP 2007 (Oct. 2007).

(more coming soon)