Basics

E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996. [pdf]
L. Lamport. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM, 21(7):588—565, Jul. 1978. [pdf]
J. Mellor-Crummey and T. LeBlanc. "A software instruction counter." In Proceedings of the 3 rd Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 78—86, Apr. 1989. [pdf]
R.D. Schlichting and F.B. Schneider. "Fail-Stop processors: An approach to designing fault-tolerant computing systems." ACM Transactions on Computer Systems, vol. 1(3): 222—238, Aug. 1983. [pdf]

Coordinated checkpointing

blocking

J.S. Plank, M. Beck, G. Kingsley and K. Li. "Lipckpt: Transparent checkpointing under UNIX." In Proceedings of the USENIX Winter 1995 Technical Conference, pp. 213—223, Jan. 1995. [pdf]

J. S. Plank, Jian Xu, R.B. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing." Technical Report CS-95-302, University of Tennessee at Knoxville, Aug. 1995. [ps]

A. Beguelin, E. Seligman and P. Stephan. "Application level fault tolerance in heterogeneous networks of workstations." In Journal Parallel & Distributed Computing, 43(2):147—155, Jun. 1997. [ps]

E. Seligman and A. Beguelin. "High-level fault tolerance in distributed programs." Technical Report CMU-CS-94-223, Department of Computer Science, Carnegie Mellon University, Dec. 1994. [ps]

non-blocking

O. Babaoglu and K. Marzullo. "Consistent global states of distributed systems: Fundamental concepts and mechanisms." Distributed Systems, Ed. S. Mullender, Addison-Wesley, pp. 55—96, 1993. [ps]

M. Chandy and L. Lamport. "Distributed snapshots: Determining global states of distributed systems." In ACM Transactions on Computing Systems, 3(1):63—75, Aug. 1985. [pdf]

K. Li, J.F. Naughton and J.S. Plank. "Real-time, concurrent checkpoint for parallel programs." In Proceedings of the 1990 Conference on the Principles and Practice of Parallel Programming, pp. 79—88, Mar. 1990. [pdf]

Uncoordinated checkpointing

Y. M. Wang. "Space reclamation for uncoordinated checkpointing in message-passing systems." Ph.D. Thesis, University of Illinois Urbana-Champaign, Aug. 1993. [???]
Y. M. Wang, P. Y. Chung, I. J. Lin and W. K. Fuchs. "Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems." In IEEE Transactions on Parallel and Distributed Systems, 6(5):546—554, May 1995. [???]
Y. M. Wang, P. Y. Chung, and W. K. Fuchs, " Tight upper bound on useful distributed system checkpoints," Tech. Rep. CRHC-95-16, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1995. [ps]

Message logging

survey: Alvisi/Marzullo

L. Alvisi and K. Marzullo. "Message logging: Pessimistic, optimistic, causal and optimal." In IEEE Transactions on Software Engineering, 24(2):149—159, Feb. 1998. [ps]

optimistic:

R. Strom and S. Yemini. "Optimistic recovery in distributed systems." ACM Transactions on Computer Systems, 3(3): 204—226, Aug. 1985. [pdf]

D.B. Johnson and W. Zwaenepoel. "Recovery in distributed systems using optimistic message logging and checkpointing." In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing Systems, PODC-88, pp. 171—181, Aug. 1988. [pdf]

D. B. Johnson and W. Zwaenepoel. "Transparent optimistic rollback recovery." In Operating Systems Review, pp. 99—102, Apr. 1991. [ps]

sender-based logging:

D.B. Johnson and W. Zwaenepoel. "Sender-based message logging." In Proceedings of the Seventeenth International Symposium on Fault-Tolerant Computing (FTCS-17), pp. 14—19, Jun. 1987. [ps]

J. Xu, R.B. Netzer, and M. Mackey. "Sender-based message logging for reducing rollback propagation." In Proceedings of the Seventh IEEE Symposium on Parallel and Distributed Processing, pp. 602—609, 1995. [???]

causal logging:

Manetho:

E.N. Elnozahy. "Manetho: Fault tolerance in distributed systems using rollback-recovery and process replication." Ph.D. Thesis, Rice University, Oct. 1993. Also available as Technical Report 93-212, Department of Computer Science, Rice University. [ps]

E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. "The performance of consistent checkpointing." In Proceedings of the Eleventh Symposium on Reliable Distributed Systems, pp. 39—47, Oct. 1992. [ps]

E.N. Elnozahy and W. Zwaenepoel. "On the use and implementation of message logging." In Proceedings of the Twenty Fourth International Symposium on Fault-Tolerant Computing (FTCS-24), pp. 298—307, Jun. 1994. [ps]

E.N. Elnozahy and W. Zwaenepoel. "Manetho, transparent rollback-recovery with low overhead, limited rollback and fast output commit." In IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing, 41(5):526—531, May 1992. [ps]

FBL: Alvisi and Maruzllo

L. Alvisi and K. Marzullo. "Trade-offs in implementing causal message logging protocols." In Proceedings of the 1996 ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing Systems (PODC), pp. 58—67, 1996. [pdf]

Byzantine failures

Tennessee:

Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix operations using checksum and reverse computation." In Proceedings of 6 th Symposium on the Frontiers of Massively Parallel Computation, Oct. 1996. [ps]

Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix operations for network of workstations using multiple checkpointing." In Proceedings of HPC Asia’97, High Performance Computing in the Information Superhighway, pp. 460—465, Apr. 1997. [ps]

J. S. Plank, K. Li, and M.A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel & Distributed Systems, 9(10):972—986, Oct. 1998. [ps]

J. S. Plank, Y. Kim and J.J. Dongarra. "Algorithm-based diskless checkpointing for fault-tolerant matrix computations." In Proceedings of the Twenty Fifth International Symposium on Fault-Tolerant Computing Systems, pp. 351—360, Jun. 1995. [ps]

J. S. Plank, K. Youngbae and J. J. Dongara. "Fault-tolerant matrix operations for networks of workstations using diskless checkpointing." In Journal of Parallel & Distributed Computing, 43(2):125—138, Jun. 1997. [ps]

Prith Banerjee

Prithviraj Banerjee, Vijay Balasubramanian, and Amber Roy-Chowdhury. "Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors". to appear in Foundations of Dependable Computing: Vol III. System Implementation. Gary Koob, editor, Kluwer Academic Publishers. [ps]

self-checking programs

Replay and debugging

Netzer et al:

R.B. Netzer and B.P. Miller. "Optimal tracing and replay for debugging message-passing parallel programs." In Proceedings of Supercomputing’92, pp. 502—511, Nov. 1992. [ps]

R.B. Netzer and J. Xu. "Adaptive message logging for incremental program replay." In IEEE Parallel and Distributed Technology, 1(4):32—39, Nov. 1993. [???]

R.B. Netzer and J. Xu. "Replaying distributed programs without message logging." In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 137—147, Aug. 1997. [???]

Objective CAML

[online manual better reference?]

Shared memory

N. Neves, M. Castro and P. Guedes. "A checkpoint protocol for an entry consistent shared memory system." In Proceedings of the 13 th ACM Symposium on Principles of Distributed Computing, Aug. 1994. [ps]
L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent checkpointing for distributed shared memory." In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, HPDC-5, pp. 422-431, Aug. 1996. [href]

MPI

G. Stellner. "CoCheck: Checkpointing and process migration for MPI." In Proceedings of the 10 th International Parallel Processing Symposium, Apr. 1996. [ps]
W-J Li. And J-J Tsay. "Checkpointing message-passing interface (MPI) parallel programs." In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, pp. 147—152, 1997. [href]

Compiling

G. Barigazzi and L. Strigini. "Application-transparent setting of recovery points." In Proceedings of the Thirteenth International Symposium on Fault-Tolerant Computing Systems, FTCS-13, pp. 48—55, 1983. [???]
M. Beck, J. S. Plank and G. Kingsley. "Compiler-assisted checkpointing." Technical Report CS-94-269, Department of Computer Science, University of Tennessee at Knoxville, Dec. 1994. [ps]
P.L. Ecuyer and J. Malefant. "Computing optimal checkpointing strategies for rollback and recovery systems." In IEEE Transactions on Computers, vol. 37, pp. 491—496, Apr. 1988. [???]
C.M. Krishna, G. Kang and Y. Lee. "Optimization criteria for checkpoint placement." In Communications of the ACM, 27(10):1008—1012, Oct. 1984. [pdf]
J. Long, W.K. Fuchs and J.A. Abraham. "Compiler-assisted static checkpoint insertion." In Proceedings of the Twenty Second Annual International Symposium on Fault-Tolerant Computing, FTCS-22, pp. 58—65, Jul. 1992. [???]
D. Manivannan, R. H. Netzer, and M. Singhal. "Finding consistent global checkpoints in a distributed computation." In IEEE Transactions on Parallel & Distributed Systems, 8(6):623—627, Jun. 1997. [ps]
J. S. Plank, M. Beck and G. Kingsley. "Compiler-assisted memory exclusion for fast checkpointing." In IEEE Technical Committee on Operating Systems Newsletter, Special Issue on Fault Tolerance, pp. 62—67, Dec. 1995. [ps]
J. S. Plank, Y. Chen, K. Li, M. Beck and G. Kingsley. "Memory exclusion: Optimizing the performance of checkpointing systems." Technical Report UT-CS-96-335, University of Tennessee, Aug. 1996. [ps]
A. Ziv and J. Bruck. "An on-line algorithm for checkpoint placement." In IEEE Transactions on Computers, 46(9):976—985, Sep. 1997. [ps]

Other stuff

A.C. Klaiber and H.M. Levy. "Crash recovery for scientific applications." In Proceedings of the International Conference on Parallel and Distributed Systems, 1993. [???]
S. Rangarajan, S. Garg and Y. Huang. "Checkpoints-on-demand with active replication." In Proceedings of the Seventeenth Symposium on Reliable Distributed Systems, pp. 75—83, Oct. 1998. [href]
L.M. Silva and J.G. Silva. "An experimental study about diskless checkpointing." In Proceedings of the 24 th EUROMICRO Conference, pp. 395—402, Aug. 1998 [pdf]
L.M. Silva and J.G. Silva. "System-level versus user-defined checkpointing." In Proceedings of the Seventeenth Symposium on Reliable Distributed Systems, pp. 68—74, Oct. 1998. [pdf]
L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent checkpointing for distributed shared memory." In Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, HPDC-5, pp. 422-431, Aug. 1996. [href]
L.M. Silva, J.G. Silva, S. Chapple and L. Clarke. "Portable checkpointing and recovery." In Proceedings of the 4 th International Symposium on High-Performance Distributed Computing, HPDC-4, pp. 188—195, Aug. 1995. [???]
R. E. Strom, D. F. Bacon and S. A. Yemini. "Volatile logging in n-fault-tolerant distributed systems." In Proceedings of the Eighteenth International Symposium on Fault-Tolerant Computing Systems, pp. 44—49, 1988. [???]
K. Tanaka. H. Higaki and M. Takizawa. "Object-based checkpoints in distributed systems." In Computer Systems Science & Engineering, 13(3):179—185, May 1998. [ps]
P. Tullmann, J. Lepreau, B. Ford and M. Hibler. "User-level checkpointing through exportable kernel state." In Proceedings of the Fifth International Workshop on Object-Orientation in Operating Systems, pp. 85—88, Oct. 1996. [href]
Y. M. Wang, E. Chung, Y. Huang, and E.N. Elnozahy. "Integrating checkpointing with transaction processing." In Proceedings of the Twenty Seventh International Symposium on Fault-Tolerant Computing (FTCS-27), pp. 304—308, Jun.1997. [ps]

Reversible computations:

"Source-code Transformations for Efficient Reversibility", Kalyan Perumalla, Richard Fujimoto, Technical report GIT-CC-99-21, College of Computing, Georgia Tech, September 1999. [ps]
"Efficient Optimistic Parallel Simulations Using Reverse Computation", Christopher Carothers, Kalyan Perumalla and Richard Fujimoto, Best Paper, ACM/IEEE Workshop on Parallel and Distributed Simulation, 1999. [ps]

Systems:

NetSolve and Globus

Muller:

G. Muller, M. Banâtre , N. Peyrouz and B. Rochat. "Lessons from FTM: an experiment in design and implementation of a low-cost fault-tolerant system." In IEEE Transactions on Reliability, 45(2):332—340, Jun. 1996. [ps]

Seti@Home

[here]