Basics
  - E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of
    rollback-recovery protocols in message-passing systems. Technical Report
    CMU-CS-96-181, Carnegie Mellon University, October 1996. [pdf]
 
  - L. Lamport. "Time, clocks, and the ordering of events in a
    distributed system." Communications of the ACM, 21(7):588565,
    Jul. 1978. [pdf] 
 
  - J. Mellor-Crummey and T. LeBlanc. "A software instruction
    counter." In Proceedings of the 3 rd Symposium on Architectural
    Support for Programming Languages and Operating Systems, pp. 7886,
    Apr. 1989. [pdf]
 
  - R.D. Schlichting and F.B. Schneider. "Fail-Stop processors: An
    approach to designing fault-tolerant computing systems." ACM
    Transactions on Computer Systems, vol. 1(3): 222238, Aug. 1983. [pdf]
 
Coordinated checkpointing
  blocking
  
    - J.S. Plank, M. Beck, G. Kingsley and K. Li. "Lipckpt: Transparent
      checkpointing under UNIX." In Proceedings of the USENIX
      Winter 1995 Technical Conference, pp. 213223, Jan. 1995. [pdf]
 
    - J. S. Plank, Jian Xu, R.B. Netzer, "Compressed differences: An
      algorithm for fast incremental checkpointing." Technical Report
      CS-95-302, University of Tennessee at Knoxville, Aug. 1995. [ps]
 
    - A. Beguelin, E. Seligman and P. Stephan. "Application level fault
      tolerance in heterogeneous networks of workstations." In Journal
      Parallel & Distributed Computing, 43(2):147155, Jun. 1997. [ps]
 
    - E. Seligman and A. Beguelin. "High-level fault tolerance in
      distributed programs." Technical Report CMU-CS-94-223, Department of
      Computer Science, Carnegie Mellon University, Dec. 1994. [ps]
 
  
  non-blocking
  
    - O. Babaoglu and K. Marzullo. "Consistent global states of
      distributed systems: Fundamental concepts and mechanisms." Distributed
      Systems, Ed. S. Mullender, Addison-Wesley, pp. 5596,
      1993. [ps]
 
    - M. Chandy and L. Lamport. "Distributed snapshots: Determining
      global states of distributed systems." In ACM Transactions
      on Computing Systems, 3(1):6375, Aug. 1985. [pdf]
 
    - K. Li, J.F. Naughton and J.S. Plank. "Real-time, concurrent
      checkpoint for parallel programs." In Proceedings of the 1990 Conference
      on the Principles and Practice of Parallel Programming, pp. 7988,
      Mar. 1990. [pdf]
 
  
Uncoordinated checkpointing
  - Y. M. Wang. "Space reclamation for uncoordinated checkpointing in
    message-passing systems." Ph.D. Thesis, University of Illinois
    Urbana-Champaign, Aug. 1993. [???]
 
  - Y. M. Wang, P. Y. Chung, I. J. Lin and W. K. Fuchs. "Checkpoint space
    reclamation for uncoordinated checkpointing in message-passing
    systems." In IEEE Transactions on Parallel and Distributed Systems,
    6(5):546554, May 1995. [???]
 
  - Y. M. Wang, P. Y. Chung, and W. K. Fuchs, " Tight upper bound on
    useful distributed system checkpoints," Tech. Rep. CRHC-95-16,
    Coordinated Science Laboratory, University of Illinois at Urbana-Champaign,
    1995. [ps]
 
Message logging
  survey: Alvisi/Marzullo
  
    - L. Alvisi and K. Marzullo. "Message logging: Pessimistic,
      optimistic, causal and optimal." In IEEE Transactions on Software
      Engineering, 24(2):149159, Feb. 1998. [ps]
 
  
  optimistic: 
  
    - R. Strom and S. Yemini. "Optimistic recovery in distributed
      systems." ACM Transactions on Computer Systems, 3(3): 204226,
      Aug. 1985. [pdf]
 
    - D.B. Johnson and W. Zwaenepoel. "Recovery in distributed systems
      using optimistic message logging and checkpointing." In Proceedings
      of the Sixth Annual ACM Symposium on Principles of Distributed Computing
      Systems, PODC-88, pp. 171181, Aug. 1988. [pdf]
 
    - D. B. Johnson and W. Zwaenepoel. "Transparent optimistic rollback
      recovery." In Operating Systems Review, pp. 99102, Apr.
      1991. [ps]
 
  
  sender-based logging: 
  
    - D.B. Johnson and W. Zwaenepoel. "Sender-based message
      logging." In Proceedings of the Seventeenth International Symposium
      on Fault-Tolerant Computing (FTCS-17), pp. 1419, Jun. 1987. [ps]
 
    - J. Xu, R.B. Netzer, and M. Mackey. "Sender-based message logging
      for reducing rollback propagation." In Proceedings of the
      Seventh IEEE Symposium on Parallel and Distributed Processing, pp. 602609,
      1995. [???]
 
  
  causal logging:
  
    Manetho: 
    
      - E.N. Elnozahy. "Manetho: Fault tolerance in distributed systems
        using rollback-recovery and process replication." Ph.D. Thesis,
        Rice University, Oct. 1993. Also available as Technical Report 93-212,
        Department of Computer Science, Rice University. [ps]
 
      - E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel. "The performance
        of consistent checkpointing." In Proceedings of the Eleventh
        Symposium on Reliable Distributed Systems, pp. 3947, Oct.
        1992. [ps]
 
      - E.N. Elnozahy and W. Zwaenepoel. "On the use and implementation
        of message logging." In Proceedings of the Twenty  Fourth
        International Symposium on Fault-Tolerant Computing (FTCS-24), pp.
        298307, Jun. 1994. [ps]
 
      - E.N. Elnozahy and W. Zwaenepoel. "Manetho, transparent
        rollback-recovery with low overhead, limited rollback and fast output
        commit." In IEEE Transactions on Computers, Special Issue on
        Fault-Tolerant Computing, 41(5):526531, May 1992. [ps]
 
    
    FBL: Alvisi and Maruzllo
    
      - L. Alvisi and K. Marzullo. "Trade-offs in implementing causal
        message logging protocols." In Proceedings of the 1996 ACM
        SIGACT-SIGOPS Symposium on Principles of Distributed Computing Systems (PODC),
        pp. 5867, 1996. [pdf]
 
    
  
Byzantine failures
  Tennessee: 
  
    - Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix
      operations using checksum and reverse computation." In Proceedings
      of 6 th Symposium on the Frontiers of Massively Parallel Computation, Oct.
      1996. [ps]
 
    - Y. Kim, J.S. Plank and J.J. Dongarra. "Fault-tolerant matrix
      operations for network of workstations using multiple checkpointing."
      In Proceedings of HPC Asia97, High Performance Computing in
      the Information Superhighway, pp. 460465, Apr. 1997. [ps]
 
    - J. S. Plank, K. Li, and M.A. Puening. "Diskless
      checkpointing." IEEE Transactions on Parallel & Distributed
      Systems, 9(10):972986, Oct. 1998. [ps]
 
    - J. S. Plank, Y. Kim and J.J. Dongarra. "Algorithm-based diskless
      checkpointing for fault-tolerant matrix computations." In Proceedings
      of the Twenty Fifth International Symposium on Fault-Tolerant Computing
      Systems, pp. 351360, Jun. 1995. [ps]
 
    - J. S. Plank, K. Youngbae and J. J. Dongara. "Fault-tolerant matrix
      operations for networks of workstations using diskless
      checkpointing." In Journal of Parallel & Distributed Computing,
      43(2):125138, Jun. 1997. [ps]
 
  
  Prith Banerjee 
  
    - Prithviraj Banerjee, Vijay Balasubramanian, and Amber Roy-Chowdhury.
      "Compiler Assisted Synthesis of Algorithm-Based Checking in
      Multiprocessors". to appear in Foundations of Dependable
      Computing: Vol III. System Implementation. Gary Koob, editor, Kluwer
      Academic Publishers. [ps]
 
  
  self-checking programs
Replay and debugging
  Netzer et al:
  
    - R.B. Netzer and B.P. Miller. "Optimal tracing and replay for
      debugging message-passing parallel programs." In Proceedings of
      Supercomputing92, pp. 502511, Nov. 1992. [ps]
 
    - R.B. Netzer and J. Xu. "Adaptive message logging for incremental
      program replay." In IEEE Parallel and Distributed Technology,
      1(4):3239, Nov. 1993. [???]
 
    - R.B. Netzer and J. Xu. "Replaying distributed programs without
      message logging." In Proceedings of the Sixth IEEE International
      Symposium on High Performance Distributed Computing (HPDC), pp. 137147,
      Aug. 1997. [???]
 
  
  Objective CAML
  
Shared memory
  - N. Neves, M. Castro and P. Guedes. "A checkpoint protocol for an
    entry consistent shared memory system." In Proceedings of the 13 th
    ACM Symposium on Principles of Distributed Computing, Aug. 1994. [ps]
 
  - L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent
    checkpointing for distributed shared memory." In Proceedings of the
    Fifth IEEE International Symposium on High Performance Distributed Computing,
    HPDC-5, pp. 422-431, Aug. 1996. [href]
 
MPI
  - G. Stellner. "CoCheck: Checkpointing and process migration for MPI."
    In Proceedings of the 10 th International Parallel Processing
    Symposium, Apr. 1996. [ps]
 
  - W-J Li. And J-J Tsay. "Checkpointing message-passing interface
    (MPI) parallel programs." In Proceedings of the Pacific Rim
    International Symposium on Fault-Tolerant Systems, pp. 147152, 1997.
    [href]
 
Compiling
  - G. Barigazzi and L. Strigini. "Application-transparent setting of
    recovery points." In Proceedings of the Thirteenth International
    Symposium on Fault-Tolerant Computing Systems, FTCS-13, pp. 4855,
    1983. [???]
 
  - M. Beck, J. S. Plank and G. Kingsley. "Compiler-assisted
    checkpointing." Technical Report CS-94-269, Department of Computer
    Science, University of Tennessee at Knoxville, Dec. 1994. [ps]
 
  - P.L. Ecuyer and J. Malefant. "Computing optimal checkpointing
    strategies for rollback and recovery systems." In IEEE Transactions
    on Computers, vol. 37, pp. 491496, Apr. 1988. [???]
 
  - C.M. Krishna, G. Kang and Y. Lee. "Optimization criteria for
    checkpoint placement." In Communications of the ACM, 27(10):10081012,
    Oct. 1984. [pdf]
 
  - J. Long, W.K. Fuchs and J.A. Abraham. "Compiler-assisted static
    checkpoint insertion." In Proceedings of the Twenty Second
    Annual International Symposium on Fault-Tolerant Computing, FTCS-22,
    pp. 5865, Jul. 1992. [???]
 
  - D. Manivannan, R. H. Netzer, and M. Singhal. "Finding consistent
    global checkpoints in a distributed computation." In IEEE
    Transactions on Parallel & Distributed Systems, 8(6):623627, Jun.
    1997. [ps]
 
  - J. S. Plank, M. Beck and G. Kingsley. "Compiler-assisted memory
    exclusion for fast checkpointing." In IEEE Technical Committee
    on Operating Systems Newsletter, Special Issue on Fault Tolerance, pp.
    6267, Dec. 1995. [ps]
 
  - J. S. Plank, Y. Chen, K. Li, M. Beck and G. Kingsley. "Memory
    exclusion: Optimizing the performance of checkpointing systems."
    Technical Report UT-CS-96-335, University of Tennessee, Aug. 1996. [ps]
 
  - A. Ziv and J. Bruck. "An on-line algorithm for checkpoint
    placement." In IEEE Transactions on Computers, 46(9):976985,
    Sep. 1997. [ps]
 
Other stuff
  - A.C. Klaiber and H.M. Levy. "Crash recovery for scientific
    applications." In Proceedings of the International Conference on
    Parallel and Distributed Systems, 1993. [???]
 
  - S. Rangarajan, S. Garg and Y. Huang. "Checkpoints-on-demand with
    active replication." In Proceedings of the Seventeenth
    Symposium on Reliable Distributed Systems, pp. 7583, Oct. 1998. [href]
 
  - L.M. Silva and J.G. Silva. "An experimental study about diskless
    checkpointing." In Proceedings of the 24 th EUROMICRO
    Conference, pp. 395402, Aug. 1998 [pdf]
 
  - L.M. Silva and J.G. Silva. "System-level versus user-defined
    checkpointing." In Proceedings of the Seventeenth Symposium
    on Reliable Distributed Systems, pp. 6874, Oct. 1998. [pdf]
 
  - L.M. Silva, J.G. Silva and S. Chapple. "Portable transparent
    checkpointing for distributed shared memory." In Proceedings of the
    Fifth IEEE International Symposium on High Performance Distributed Computing,
    HPDC-5, pp. 422-431, Aug. 1996. [href]
 
  - L.M. Silva, J.G. Silva, S. Chapple and L. Clarke. "Portable
    checkpointing and recovery." In Proceedings of the 4 th International
    Symposium on High-Performance Distributed Computing, HPDC-4, pp. 188195,
    Aug. 1995. [???]
 
  - R. E. Strom, D. F. Bacon and S. A. Yemini. "Volatile logging in
    n-fault-tolerant distributed systems." In Proceedings of the
    Eighteenth International Symposium on Fault-Tolerant Computing Systems,
    pp. 4449, 1988. [???]
 
  - K. Tanaka. H. Higaki and M. Takizawa. "Object-based checkpoints in
    distributed systems." In Computer Systems Science &
    Engineering, 13(3):179185, May 1998. [ps]
 
  - P. Tullmann, J. Lepreau, B. Ford and M. Hibler. "User-level
    checkpointing through exportable kernel state." In Proceedings of
    the Fifth International Workshop on Object-Orientation in Operating Systems,
    pp. 8588, Oct. 1996. [href]
 
  - Y. M. Wang, E. Chung, Y. Huang, and E.N. Elnozahy. "Integrating
    checkpointing with transaction processing." In Proceedings of the
    Twenty Seventh International Symposium on Fault-Tolerant Computing (FTCS-27),
    pp. 304308, Jun.1997. [ps]
 
Reversible computations:
  - "Source-code Transformations for Efficient Reversibility",
    Kalyan Perumalla, Richard Fujimoto, Technical report GIT-CC-99-21, College
    of Computing, Georgia Tech, September 1999. [ps]
 
  - "Efficient Optimistic Parallel Simulations Using Reverse
    Computation", Christopher Carothers, Kalyan Perumalla and Richard
    Fujimoto, Best Paper, ACM/IEEE Workshop on Parallel and Distributed
    Simulation, 1999. [ps]
 
Systems:
  NetSolve and Globus
  Muller:
  
    - G. Muller, M. Banβtre , N. Peyrouz and B. Rochat. "Lessons from
      FTM: an experiment in design and implementation of a low-cost
      fault-tolerant system." In IEEE Transactions on Reliability,
      45(2):332340, Jun. 1996. [ps]
 
  
  Seti@Home