Fall 2012: CS5410 Fault-tolerant Distributed Computer Systems

CS5414 (Fall 2012) Programming Assignment Phase IV
A Fault-tolerant Failure Detector

Due: 11pm, December 4, 2012.

This phase is optional and also non-trivial in scope and difficulty.

If you do not submit a solution for Phase IV then we will compute the average (using relative weights) of the grades you received in Phases I - III. And you will receive that average as your grade for Phase IV.
If you do submit a solution for Phase IV, then we will grade it and use that grade as your grade for Phase IV.

So, submitting a solution to Phase IV could cause your course grade to be decreased (if your solution does not receive a good enough grade relative to your average on phases I - III).

General Instructions. Work in a group comprising 1 to 3 students. All members of the group are responsible for understanding the entire assignment and submitted solution.

You may work in the team you used for some previous phase, or you may form a new team.

This assignment builds on work done in Phase III. Feel free to use any team's solution to Phase III as the basis for your solution to Phase IV.

No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Failure Detection as a Service

Phase III used a centralized, fault-tolerant, perfect failure detector. We now build something more realistic. In particular, if a system satisfies the synchronous model of computation and is restricted to crash failures then various means can be employed by a failure detector.

What to Build. Replace the oracle of Phase III by one or more failure detection services. Each failure detection service employs state machine replication and comprises a set of failure detection servers, where no two servers in the same service execute on the same processor. The clients for this service are:

A set of failure detection sensors that you implement. A failure detection sensor submits a request to a failure detection service when that sensor suspects a processor of failing.
The branch servers and branch GUI.

Notice that the above architecture does not specify:

Which failure detection service is responsible for each processor. (By assumption, the processor that runs a branch GUI never fails, so the failure detection service can ignore it. For other processors, avoid complexity by having a fixed, static mapping from processors to the failure detection services.)
Whether a given failure detection sensor is the client of one or of multiple failure detection services.

A clever design will leverage fate-sharing by co-locating failure detection servers, failure detection sensors, and branch servers in a way that brings benefits (or at least avoids complications).

The computing environment for Phase IV is characterized as follows:

The system satisfies the assumptions of the synchronous model of distributed computing.
Processors exhibit crash failures (but not Byzantine failures).
No processor running a branch GUI ever fails (but that processor may only run a branch GUI).
Each communication channel is bidirectional and supports TCP/IP and UDP---choose the protocol that makes the most sense.

One challenge in this phase is to define the criterion used by the failure detection sensors. Some possibilities include basing the decision on timeouts and/or TCP connection closings, but you are free to define and incorporate others (in isolation or in combination). Whatever criterion you select is unlikely to be perfect, so either it will suspect processors that haven't failed or it will not suspect processors that have failed. The rest of your system must "work correctly" despite such false suspicions.

Extra Credit. Design and implement a way for

failed failure-detection servers and/or failure-detection sensors to be identified and removed from the failure detection service, and
repaired failure-detection servers and/or failure-detection sensors to be added into a failure detection service.

Note. As has been the case throughout the semester, extra credit will not effect your grade on this project. It is offered for you to be challenged and for you to impress the instructors with your prowess and mastery of course material.

Submission Procedure

All submissions should be made through CMS. CMS provides a way for you to define your group. Be advised that each group member must take an action in creating a group, and your group cannot submit anything through CMS until the group has been created.

Submit the following files (at least):

TEAM a .txt file that contains the names (and net-ids) for all team members. Also, for each team member, give a 1 or 2 paragraph description of the tasks this team member performed and the number of hours this required.

README a .txt file that contains

The names and a description of the contents for the other files in the directory.
Instructions for installing, compiling, and running your software on our Windows system.
A tutorial that the grader can follow to start your software and to convince himself that your system implements the required functionality. Expect the grader to spend at most 10 minutes on this task. Include instructions for instigating processor failures (real and erroneous) so that the grader can observe that your system responds to these events a reasonable way.

LOGIC a .pdf file that contains

A description for the criterion used by the failure detection sensors.
A rationale to support this choice of criterion. Be sure to include:
- An explanation of conditions that could lead to a failure detection sensor producing erroneous conclusions.
- If the criterion involves some form of timing, then you should explain how you decided on any parameters that lead to the threshold(s) selected. If this threshold was selected based on experiments, then describe the experiments and include the data (presented in some easily understood form); if this threshold was selected based on deductions, then give that analysis.
A rationale for the following mappings used by your architecture:
- The mapping from failure detection sensors to processors whose failure is being detected.
- The mapping from failure detection sensors to failure detection services.
- The mapping of failure detection servers to processors, with comments about any fate-sharing and other intended consequences of co-residences.
A specification of your definition for "work correctly" (in the presence of crash failures). Justify your definition by explaining why it cannot be made stronger.
An explanation of why your system will "work correctly" even if one or more failure detection sensors suspect a processor that, in fact, has not failed.
An explanation of any assumptions about topology and configuration your system requires. Topology here refers to interconnection links; configuration here refers to the mapping from branch servers, failure detection servers, and failure detection sensors to processors.

LAUNCH.cmd which is a shell script that the grader can run in order to start the various components needed by your bank. The grader will then try some selected test cases on your running system to determine if it seems to be working as it should.

topology_file.txt should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

SourceCode A zip file containing the sources needed to compile and test your system.

Grading

Your grade will be based on the above documentation and the following elements:

Does your failure detection service implement state machine replication and employ reasonable protocols for agreement and order?
Is the failure detection criterion sensible and justified?
Are sensible mappings (failure detection sensors to processors, failure detection sensors to failure detection services, failure detection servers to processors) employed and co-residence exploited?
Is the specification of "work correctly" sensible and justified?
What is the impact of bogus failure detections on system operation?
How easy is it to follow the README file installation and sample-execution script?
Is the source code easy to understand and does it exhibit good structure?
Does the system run?

CS5414 (Fall 2012) Programming Assignment Phase IV A Fault-tolerant Failure Detector

Due: 11pm, December 4, 2012.

Failure Detection as a Service

Submission Procedure

Grading

CS5414 (Fall 2012) Programming Assignment Phase IV
A Fault-tolerant Failure Detector