Fall 2009: CS5410 Fault-tolerant Distributed Computer Systems

Phase III: Availability Through Primary/Backup

Due: 11:59pm Wednesday, 11/4/2009

General Instructions. Students are required to work together in teams. You may work in the team you used for Phase II or you may form a new team. An assignment submitted on behalf of a "team" having fewer than 2 or more than 5 students will receive a grade of F. All members of the team are responsible for understanding the entire assignment.

This assignment expects you to extend the code for some Phase I implementation. Feel free to use code or ideas from any team's solutions to any prior phase in designing those extensions. However, submissions that contain extraneous code (such as code needed for implementing Phases II but not needed in Phase III) will be penalized.

No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Background: Making Branch Servers Available

In the distributed banking system built for Phase I, bank accounts (and the funds they store) at crashed branches are unavailable until the faulty host has been repaired and restarts. Few bank customers would be willing to tolerate such outages often, if at all. More generally, the increasing dependence by business on computers for daily operations means that high-availability is no longer a requirement only for life-critical control settings. Implementing availability in our banking systems is thus not an atypical concern.

The primary/backup approach is one way that the availability of an application can be enhanced. A server (along with its state) is replicated; client requests are sent to and processed by the primary server, with state updates forwarded to all backup servers. And if the primary server fails, then (i) one of the backup servers assumes the role of primary server and (ii) clients are notified about the identity of the new primary server.

In this phase of our CS5410 project, you will program such a primary/backup system, replacing each branch server by a service with equivalent semantics but exhibiting higher availability. Thus, this phase will give you an opportunity to master and get hands-on experience with a primary/backup protocol.

What to Build

The computing environment for this phase is characterized by the following assumptions.

The system satisfies the assumptions of the synchronous model of distributed computing.
Processors exhibt crash failures (only).
No branch GUI ever fails.
Each communication channel is bidirectional.
TCP/IP and UDP communication channels are available.

Run experiments to ascertain information you need in order for your protocols to exploit the synchronous model: message-propagation delays, processor speed differences, and clock synchronization errors. Employ either TCP/IP or (the less expensive) UDP as appropriate for communication between components of your distributed banking system. As before, you will simulate a distributed system by running all components on one real computer.

Simulating Processor Failures. Extend the branch GUI from Phase I with a "button" that instigates the failure of all software running on the same (simulated) processor as this branch GUI except for the branch GUI itself. Each component should, upon failure: halt, wait a random period (ranging from 15 seconds to 2 minutes), and then recover. When a component fails, information it stored in its memory is lost; information it stored on disks is not. An executing program infers the failure of a component either from direct interactions with that component or by consulting some form of failure detector service (which you must build, if you need it and must itself be available, so carefully consider where its components are executed). In either case, timeouts and/or unexpected closing of a TCP/IP connection will likely be the basis for suspecting a failure. Upon recovery, the branch server should execute some "recovery code" that you provided.

Primary/Backup Protocols. There are many primary/backup protocols, and you are free to choose among them. Schemes that involve a single backup are somewhat simpler to build than those that involve two or more backups, because the former can employ simpler failure-detection and fail-over schemes whereas the latter often require some form of agreement protocol.

Choose one of the primary/backup protocols described in the literature (and employ either one or two backups) or adapt chain replication (with a total of 3 chain elements) for the banking application.

The branch server (as opposed to the entire distributed banking system) is the software component to which the primary/backup protocol should be applied. Replace each branch server with a highly-available branch service by deploying and running additional replicas (as backup servers) for that branch server. The primary server and the backup servers should each be deployed on distinct existing (simulated) processors of the (simulated) distributed banking system. Choose (simulated) processors that are able to communicate with each other directly, stipulating constraints on the topology of the network as necessary.

Recovery. Design and implement a recovery protocol so that a failed branch server replica, upon recovery, returns to service as a backup server in the same highly-available branch service as it previously participated.

Submission Procedure

All submissions should be made through CMS. CMS provides a way for you to define your group. Be advised that each group member must take an action in creating a group, and your group cannot submit anything through CMS until the group has been created.

Submit the following files (at least):

TEAM a .txt file that contains the names (and net-ids) for all team members. Also, for each team member give a 1 or 2 paragraph description of the tasks this team member performed and the number of hours this required.

README a .txt file that contains

The names and a description of the contents for the other files in the directory.
Instructions for installing, compiling, and running your software on our Windows system.
A tutorial that the grader can follow to start your software and to convince himself that your system implements the required functionality. Expect the grader to spend at most 10 minutes on this task.

LOGIC a .txt file that contains

A description of the experiments used to quantify the timing and delay parameters characterizing your synchronous system, along with your experimental data. Also, give the parameter values you use for failure detection and justify your choices.
A description (at most 2 pages) of the specific primary/backup protocol that was implemented and a pointer to a publication (paper or textbook) where this protocol is discussed.
A description (at most 2 pages) of the recovery protocol that was implemented along with an informal correctness argument (at the level that might be presented during a lecture in cs5410). Give a pointer to a publication (paper or textbook) if your protocol is derived from that prior work.
A description (at most 2 pages) of the failure detection service you built, along with an informal correctness argument (at the level that might be presented during a lecture in cs5410). Give a pointer to a publication (paper or textbook) if this protocol is derived from that prior work.

TOPO should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

TestPlan a .txt file that describes the process and any tools (i.e. additional programs) you wrote in order to test your system. This file should also explain what tests you ran and why this was a reasonable set of tests to have run.

SourceCode A zip file containing the sources needed to compile and test your system.

Grading

Your grade will be based on the above documentation and the following elements:

Does your system correctly implement some primary/backup protocol? Higher grades will go to protocols that are more sophisticated in their operation, meaning that their correctness argument is more subtle and/or they tolerate a larger numbers of failures.
Does your system correctly implement a recovery protocol? Higher grades will be awarded to protocols that are easier to understand yet make sensible trade-offs so that the elapsed time is not too great until a processor becomes a fully-functioning backup.
Is a reasonable scheme being employed for the failure detector? How quickly are failures detected (so that length of any transient service outage is not unnecessarily long). How likely are false alarms and how are they handled?
Is a reasonable scheme being employed so that clients to the service are notified when there is a fail-over and thus the primary constituting their point of contact to the service is changed?
Were TCP/IP and UDP communications channels used as appropriate?
How were timing your parameters chosen? Are they reasonable? Is the justification credible?
How easy is it to follow the README file installation and sample-execution script?
How thorough was your testing procedure and how creative were you in building sufficient scaffolding (test drivers etc) to test your system?
Is the source code easy to understand and does it exhibit good structure?