Phase III: Availability Through Primary/Backup

Due: 11:59pm Wednesday, 11/4/2009

General Instructions. Students are required to work together in teams. You may work in the team you used for Phase II or you may form a new team. An assignment submitted on behalf of a "team" having fewer than 2 or more than 5 students will receive a grade of F. All members of the team are responsible for understanding the entire assignment.

This assignment expects you to extend the code for some Phase I implementation. Feel free to use code or ideas from any team's solutions to any prior phase in designing those extensions. However, submissions that contain extraneous code (such as code needed for implementing Phases II but not needed in Phase III) will be penalized.

No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Background: Making Branch Servers Available

In the distributed banking system built for Phase I, bank accounts (and the funds they store) at crashed branches are unavailable until the faulty host has been repaired and restarts. Few bank customers would be willing to tolerate such outages often, if at all. More generally, the increasing dependence by business on computers for daily operations means that high-availability is no longer a requirement only for life-critical control settings. Implementing availability in our banking systems is thus not an atypical concern.

The primary/backup approach is one way that the availability of an application can be enhanced. A server (along with its state) is replicated; client requests are sent to and processed by the primary server, with state updates forwarded to all backup servers. And if the primary server fails, then (i) one of the backup servers assumes the role of primary server and (ii) clients are notified about the identity of the new primary server.

In this phase of our CS5410 project, you will program such a primary/backup system, replacing each branch server by a service with equivalent semantics but exhibiting higher availability. Thus, this phase will give you an opportunity to master and get hands-on experience with a primary/backup protocol.

What to Build

The computing environment for this phase is characterized by the following assumptions.

Run experiments to ascertain information you need in order for your protocols to exploit the synchronous model: message-propagation delays, processor speed differences, and clock synchronization errors. Employ either TCP/IP or (the less expensive) UDP as appropriate for communication between components of your distributed banking system. As before, you will simulate a distributed system by running all components on one real computer.

Simulating Processor Failures. Extend the branch GUI from Phase I with a "button" that instigates the failure of all software running on the same (simulated) processor as this branch GUI except for the branch GUI itself. Each component should, upon failure: halt, wait a random period (ranging from 15 seconds to 2 minutes), and then recover. When a component fails, information it stored in its memory is lost; information it stored on disks is not. An executing program infers the failure of a component either from direct interactions with that component or by consulting some form of failure detector service (which you must build, if you need it and must itself be available, so carefully consider where its components are executed). In either case, timeouts and/or unexpected closing of a TCP/IP connection will likely be the basis for suspecting a failure. Upon recovery, the branch server should execute some "recovery code" that you provided.

Primary/Backup Protocols. There are many primary/backup protocols, and you are free to choose among them. Schemes that involve a single backup are somewhat simpler to build than those that involve two or more backups, because the former can employ simpler failure-detection and fail-over schemes whereas the latter often require some form of agreement protocol.

Choose one of the primary/backup protocols described in the literature (and employ either one or two backups) or adapt chain replication (with a total of 3 chain elements) for the banking application.

The branch server (as opposed to the entire distributed banking system) is the software component to which the primary/backup protocol should be applied. Replace each branch server with a highly-available branch service by deploying and running additional replicas (as backup servers) for that branch server. The primary server and the backup servers should each be deployed on distinct existing (simulated) processors of the (simulated) distributed banking system. Choose (simulated) processors that are able to communicate with each other directly, stipulating constraints on the topology of the network as necessary.

Recovery. Design and implement a recovery protocol so that a failed branch server replica, upon recovery, returns to service as a backup server in the same highly-available branch service as it previously participated.

Submission Procedure

All submissions should be made through CMS. CMS provides a way for you to define your group. Be advised that each group member must take an action in creating a group, and your group cannot submit anything through CMS until the group has been created.

Submit the following files (at least):

TEAM a .txt file that contains the names (and net-ids) for all team members. Also, for each team member give a 1 or 2 paragraph description of the tasks this team member performed and the number of hours this required.

README a .txt file that contains

LOGIC a .txt file that contains

TOPO should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

TestPlan a .txt file that describes the process and any tools (i.e. additional programs) you wrote in order to test your system. This file should also explain what tests you ran and why this was a reasonable set of tests to have run.

SourceCode A zip file containing the sources needed to compile and test your system.

Grading

Your grade will be based on the above documentation and the following elements: