CS5414 (Fall 2012) Programming Assignment Phase IV
A Fault-tolerant Failure Detector


Due: 11pm, December 4, 2012.


This phase is optional and also non-trivial in scope and difficulty. So, submitting a solution to Phase IV could cause your course grade to be decreased (if your solution does not receive a good enough grade relative to your average on phases I - III).

General Instructions. Work in a group comprising 1 to 3 students. All members of the group are responsible for understanding the entire assignment and submitted solution.

You may work in the team you used for some previous phase, or you may form a new team.

This assignment builds on work done in Phase III. Feel free to use any team's solution to Phase III as the basis for your solution to Phase IV.

No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Failure Detection as a Service

Phase III used a centralized, fault-tolerant, perfect failure detector. We now build something more realistic. In particular, if a system satisfies the synchronous model of computation and is restricted to crash failures then various means can be employed by a failure detector.

What to Build. Replace the oracle of Phase III by one or more failure detection services. Each failure detection service employs state machine replication and comprises a set of failure detection servers, where no two servers in the same service execute on the same processor. The clients for this service are:

Notice that the above architecture does not specify:

A clever design will leverage fate-sharing by co-locating failure detection servers, failure detection sensors, and branch servers in a way that brings benefits (or at least avoids complications).

The computing environment for Phase IV is characterized as follows:

One challenge in this phase is to define the criterion used by the failure detection sensors. Some possibilities include basing the decision on timeouts and/or TCP connection closings, but you are free to define and incorporate others (in isolation or in combination). Whatever criterion you select is unlikely to be perfect, so either it will suspect processors that haven't failed or it will not suspect processors that have failed. The rest of your system must "work correctly" despite such false suspicions.

Extra Credit. Design and implement a way for

  1. failed failure-detection servers and/or failure-detection sensors to be identified and removed from the failure detection service, and
  2. repaired failure-detection servers and/or failure-detection sensors to be added into a failure detection service.
Note. As has been the case throughout the semester, extra credit will not effect your grade on this project. It is offered for you to be challenged and for you to impress the instructors with your prowess and mastery of course material.

Submission Procedure

All submissions should be made through CMS. CMS provides a way for you to define your group. Be advised that each group member must take an action in creating a group, and your group cannot submit anything through CMS until the group has been created.

Submit the following files (at least):

TEAM a .txt file that contains the names (and net-ids) for all team members. Also, for each team member, give a 1 or 2 paragraph description of the tasks this team member performed and the number of hours this required.

README a .txt file that contains

LOGIC a .pdf file that contains

LAUNCH.cmd which is a shell script that the grader can run in order to start the various components needed by your bank. The grader will then try some selected test cases on your running system to determine if it seems to be working as it should.

topology_file.txt should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

SourceCode A zip file containing the sources needed to compile and test your system.

Grading

Your grade will be based on the above documentation and the following elements: