CS 717: Programming for Fault-tolerance

Instructors:

Time:

Tuesdays & Thursdays, 1:25 - 2:40 PM, 307 Phillips

Prerequisites:

Some experience with parallel programming is desirable but not essential.

Course material:

The lecture schedule is here.

The list of papers will be placed here.

Course project descriptions are here.

The final project reports are here.

Course content:

Computational power on the scale of teraflops is now available to most programmers, thanks to fast processors and large-scale parallelism. Nevertheless, many applications programs must run for days, months or even continuously to accomplish their goals. Such programs must not only exploit the performance potential of parallel hardware but must also be resilient to hardware and software failures. Although there is a lot of work on fault-tolerant software, most of this work has been done in the context of distributed systems where one assumes that nothing is known about the applications program.

In this course, the problem of making software resilient to hardware faults will be studied from a programming languages perspective. What kinds of hardware faults should software be resilient to? What does it take to make software resilient to these faults? What difference does it make if we can analyze programs before they are run? Can we generate customized protocols that are more efficient than the general-purpose protocols in the distributed computing literature? How much of this can be automated? We will address these and related questions with the goal of identifying problems for thesis research.

Course work:

Students will be expected to read and present papers from the literature, and to do a substantial final project.

Topics covered in the course:

  1. Basics
  2. Check-pointing:
  3. Message-logging:
  4. Byzantine failures:
  5. Replay/reversible computing:
  6. Systems case studies: