CS 717: Programming for Fault-tolerance

Instructors:

Keshav Pingali ( pingali@cs.cornell.edu) Rhodes 457
Paul Stodghill (stodghil@cs.cornell.edu) Rhodes 496

Time:

Tuesdays & Thursdays, 1:25 - 2:40 PM, 307 Phillips

Prerequisites:

Some experience with parallel programming is desirable but not essential.

Course material:

The lecture schedule is here.

The list of papers will be placed here.

Course project descriptions are here.

The final project reports are here.

Course content:

Computational power on the scale of teraflops is now available to most programmers, thanks to fast processors and large-scale parallelism. Nevertheless, many applications programs must run for days, months or even continuously to accomplish their goals. Such programs must not only exploit the performance potential of parallel hardware but must also be resilient to hardware and software failures. Although there is a lot of work on fault-tolerant software, most of this work has been done in the context of distributed systems where one assumes that nothing is known about the applications program.

In this course, the problem of making software resilient to hardware faults will be studied from a programming languages perspective. What kinds of hardware faults should software be resilient to? What does it take to make software resilient to these faults? What difference does it make if we can analyze programs before they are run? Can we generate customized protocols that are more efficient than the general-purpose protocols in the distributed computing literature? How much of this can be automated? We will address these and related questions with the goal of identifying problems for thesis research.

Course work:

Students will be expected to read and present papers from the literature, and to do a substantial final project.

Topics covered in the course:

Basics
- Distributed-memory model, MPI
- Shared-memory model, OpenMP
- Event-ordering and causality in distributed systems
- Failure models: fail-stop and Byzantine failures
Check-pointing:
- Uniprocessor protocols: incremental check-pointing, compression
- Un-coordinated parallel check-pointing: protocols, roll-back problem, garbage collection of saved states
- Blocking, co-ordinated parallel check-pointing: application-level, and system-level protocols
- Non-blocking, co-ordinated parallel check-pointing: protocols for taking distributed snapshots
- Program analysis for optimizing check-pointing protocols
Message-logging:
- Optimistic logging
- Pessimistic logging
- Causal logging
- Program analysis for optimizing message-logging protocols
Byzantine failures:
- Self-checking programs
- Automatic generation of self-checking numerical programs
Replay/reversible computing:
- Debugging distributed-memory programs: logging and replay
- Reversible programs: program transformations for making programs reversible
Systems case studies:
- Netsolve
- Globus
- Seti@Home
- Apache JServ