CS5414 (Fall 2012) Programming Assignment Phase IV
A Fault-tolerant Failure Detector
Due: 11pm, December 4, 2012.
This phase is optional and also non-trivial in scope and difficulty.
- If you do not submit a solution for Phase IV then we will compute the average
(using relative weights) of the grades you received in Phases I - III.
And you will receive that average as your grade for Phase IV.
-
If you do submit a solution for Phase IV,
then we will grade it and use that grade as your
grade for Phase IV.
So, submitting a solution to Phase IV could cause your course grade to be decreased
(if your solution does not receive a good enough grade relative to
your average on phases I - III).
General Instructions.
Work in a group comprising 1 to 3 students.
All members of the group are responsible for understanding the entire
assignment and submitted solution.
You may work in the team you used for some previous phase, or you may form a new team.
This assignment builds on work done in Phase III.
Feel free to use any team's solution to Phase III as the basis for your
solution to Phase IV.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is
prohibited and will be treated as a violation of the University's
academic integrity code.
Failure Detection as a Service
Phase III used a centralized, fault-tolerant, perfect failure detector.
We now build something more realistic.
In particular, if a system satisfies
the synchronous model of computation and is restricted to crash failures then
various means can be employed by a failure detector.
What to Build.
Replace the oracle of Phase III by one or more failure detection services.
Each failure detection service employs state machine replication and
comprises a set of failure detection servers,
where no two servers in the same service execute on the same processor.
The clients for this service are:
-
A set of failure detection sensors that you implement.
A failure detection sensor submits a request to a failure detection
service when that sensor suspects a processor of failing.
-
The branch servers and branch GUI.
Notice that the above architecture does not specify:
-
Which failure detection service is responsible for each processor.
(By assumption,
the processor that runs a branch GUI never fails, so the failure detection service
can ignore it.
For other processors,
avoid complexity by having a fixed, static mapping from
processors to the failure detection services.)
-
Whether a given failure detection sensor is the client of one or of multiple
failure detection services.
A clever design will leverage fate-sharing
by co-locating failure detection servers,
failure detection sensors, and branch servers in a way that brings benefits
(or at least avoids complications).
The computing environment for Phase IV is characterized as follows:
- The system satisfies the assumptions of the synchronous model of distributed computing.
- Processors exhibit crash failures (but not Byzantine failures).
- No processor running a branch GUI ever fails (but that processor may only
run a branch GUI).
- Each communication channel is bidirectional and supports
TCP/IP and UDP---choose the protocol that makes the most sense.
One challenge in this phase is to define the criterion used by the failure
detection sensors.
Some possibilities include basing the decision on
timeouts and/or TCP connection closings,
but you are free to define and incorporate others (in isolation or in combination).
Whatever criterion you select is unlikely to be perfect, so either it will suspect
processors that haven't failed or it will not suspect processors that have failed.
The rest of your system must "work correctly" despite such false suspicions.
Extra Credit.
Design and implement a way for
- failed failure-detection servers and/or failure-detection sensors to be
identified and removed from the failure detection service, and
- repaired failure-detection servers and/or failure-detection sensors to be
added into a failure detection service.
Note.
As has been the case throughout the semester,
extra credit will not effect your grade on this project.
It is offered for you to be challenged and for you to impress the instructors
with your prowess and mastery of course material.
Submission Procedure
All submissions should be made through
CMS.
CMS provides a way for you to
define
your group.
Be advised that each group member must take an action in creating a group,
and your group cannot submit anything through CMS until the group has been created.
Submit the following files (at least):
-
TEAM a .txt file that contains the names (and net-ids) for all team members.
Also, for each team member, give a 1 or 2 paragraph description of the tasks
this team member performed and the number of hours this required.
-
README a .txt file that contains
- The names and a description of the contents for the other files in the directory.
- Instructions for installing, compiling, and running your software on our
Windows system.
- A tutorial that the grader can follow to start your software and to convince
himself that your system implements the required functionality. Expect the grader
to spend at most 10 minutes on this task. Include instructions for
instigating processor failures (real and erroneous)
so that the grader can observe that your
system responds to these events a reasonable way.
-
LOGIC a .pdf file that contains
-
A description for the criterion used by the failure detection sensors.
-
A rationale to support this choice of criterion.
Be sure to include:
-
An explanation of conditions that could lead to a failure detection sensor producing
erroneous conclusions.
-
If the criterion involves some form of timing,
then you should explain how you decided on any
parameters that lead to the threshold(s) selected.
If this threshold was selected based on experiments, then describe the experiments
and include the data (presented in some easily understood form);
if this threshold was selected based on deductions, then give that analysis.
-
A rationale for the following mappings used by your architecture:
-
The mapping from failure detection sensors to processors whose failure is being
detected.
-
The mapping from failure detection sensors to failure detection services.
-
The mapping of failure detection servers to processors, with comments
about any fate-sharing and other intended consequences of co-residences.
-
A specification of your definition for "work correctly" (in the presence of
crash failures).
Justify your definition by explaining why it cannot be made stronger.
-
An explanation of why your system will "work correctly" even if one or more
failure detection sensors suspect a processor that, in fact, has not failed.
-
An explanation of any assumptions about topology and configuration your system requires.
Topology here refers to interconnection links;
configuration here refers to the mapping from branch servers, failure detection servers,
and failure detection sensors to processors.
LAUNCH.cmd which is a shell script that the grader can run in order to
start the various components needed by your bank.
The grader will then try some selected test cases on your running
system to determine if it seems to be working as it should.
-
topology_file.txt should specify an interesting interconnection topology for a
multi-branch bank
that will be used to illustrate the operation of your system.
-
SourceCode A zip file containing the sources needed to compile and test your system.
Grading
Your grade will be based on the above documentation and
the following elements:
-
Does your failure detection service implement state machine replication and
employ reasonable protocols for agreement
and order?
-
Is the failure detection criterion sensible and justified?
-
Are sensible mappings (failure detection sensors to processors,
failure detection sensors to failure detection services,
failure detection servers to processors) employed and co-residence
exploited?
-
Is the specification of "work correctly" sensible and justified?
-
What is the impact of bogus failure detections on system operation?
-
How easy is it to follow the README file installation and sample-execution script?
-
Is the source code easy to understand and does it exhibit good structure?
-
Does the system run?