CS514: Fault-tolerant Distributed Computer Systems -- Topic Outline

Here is a high-level listing of the topics we will try to cover this semester. Recommended readings are noted along with each topic.  You can find what you need in [B05] but the additional references cited include useful contrasting perspectives and are well worth reading if you have the time. 

Note: list of topics may evolve or change during the first weeks of the class.

References

[B05] Ken Birman.,  Reliable Distributed Systems.  Springer-Verlag, (May 2005).

[BS96] Tom Bressoud and Fred B. Schneider Hypervisor-based Fault-Tolerance. ACM Transactions on Computer Systems 14 , 1 (February 1996), 80-107. pdf

[CL85] K.M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems Vol. 3, No. 1, pp 63-75.

[EAWJ 99] Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. Technical Report. Available from http://www.cs.utexas.edu/users/lorenzo/papers/Pap6.ps or locally postscript or pdf. Because this paper was written using MicroSoft Word, windows ghostview (and not UNIX ghostview) must be used as the Helper Application for reading.

[GB85] H. Garcia-Molina and D. Barbara. How to assign votes in a distributed system. Journal of the ACM, 32(4):841--860, October 1985. pdf

[H86] M. Herlihy. A quorum-consensus replication method for abstract data types. ACM Transactions on Computer Systems, 4(1):32-53, February 1986. pdf

[L01] Leslie Lamport. Paxos Made Simple. To appear. postscript

[M94] S. Mullender (editor). Distributed Systems, Second Edition. ACM Press, Addison-Wesley Publishing Company, Reading Mass., 1994.

[MR98] D. Malkhi and M. Reiter. Byzantine quorum systems. Distributed Computing, 11(4):203--213, 1998. postscript

[MS] Keith Marzullo and Frank Schmuck. Supplying high availability with a standard network file system. postscript

[S87] Fred B. Schneider. Understanding protocols for Byzantine clock synchronization. Cornell University Computer Science Technical Report TR 87-859, August 1987. postscript