Fall 2012: CS5410 Fault-tolerant Distributed Computer Systems

CS5414 (Fall 2012) Programming Assignment Phase III
Fault-tolerance Using Primary-Backup

Due: 11pm, 11/8/2012
Relative weight: 10 units

General Instructions. Work in a group comprising 1 to 3 students. All members of the group are responsible for understanding the entire assignment and submitted solution.

You may work in the team you used for some previous phase, or you may form a new team.

This assignment builds on work done in Phase I. Feel free to use any team's solution to Phase I as the basis for your solution to Phase III.

No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Background: Making Branch Servers Available

In the distributed banking system built for Phase I, bank accounts (and the funds they store) at crashed branches are unavailable until the faulty host has been repaired and restarts. Few bank customers would be willing to tolerate such outages often, if at all. More generally, the increasing dependence by business on computers for daily operations means that high-availability is no longer a requirement only for life-critical control settings. Implementing availability in our banking systems is thus not an atypical concern.

The primary/backup approach is one way that the availability of an application can be enhanced. A server (along with its state) is replicated; client requests are sent to and processed by the primary server, with state updates forwarded to all backup servers. And if the primary server fails, then one of the backup servers assumes the role of primary server.

In this phase of our CS5414 project, you will program such a primary/backup system, replacing each branch server by a service with equivalent semantics but exhibiting higher availability. Thus, this phase will give you an opportunity to master and get hands-on experience with a primary/backup protocol.

What to Build

The computing environment for this phase is characterized as follows.

The system satisfies the assumptions of the asynchronous model of distributed computing.
Processors are fail-stop.
No processor running a branch GUI ever fails.
Each communication channel is bidirectional.
TCP/IP and UDP communication channels are available.

Primary/Backup Protocols. We recommend that you adapt one of the the primary-backup protocols discussed in lecture (see [vRG10] or [vRS04]). However, the literature contains numerous other protocols, and more ambitious groups are welcome to implement one of these other protocols instead.

The branch server (as opposed to the entire distributed banking system) is the software component to which the primary/backup protocol should be applied. Thus, you should replace each branch server with a highly-available branch service by deploying and running additional replicas (as backup servers) for that branch server. The primary server and the backup servers should each be deployed on distinct existing Java virtual machines of the (simulated) distributed banking system. Choose Java virtual machines that are able to communicate with each other directly, stipulating constraints on the topology of the network as necessary.

Configuration Management. The correct operation of a primary-backup system requires that clients and servers have some idea of the system configuration, including which server is the primary and which servers are backups. (If chain replication is used, then the configuration defines which servers are the head and the tail, as well as defining the successor and predecessor of each server in the chain.) Configuration information can be computed by components separately and in isolation if (i) each component has access to a failure detector and (ii) failures are the only events that cause changes to the configuration. This is the approach you should employ.

Recovery. Design and implement a recovery protocol so that a failed branch server replica, upon recovery, returns to service as a backup server in the same highly-available branch service as it previously participated.

Failure Detection. For this phase, create in software an oracle that simulates a centralized, fault-tolerant, perfect failure detector. Components of your system should be able to learn from this failure detector whether a given processor or system component has failed.

The oracle, in fact, is not expected to do any real detection at all---it should simply forward information that is provided using a new system GUI that a human operates while the banking system runs. We (unrealistically) assume some human operator completely controls all failures in the system and also is gracious enough to update the failure detector accordingly whenever such a failure is instigated or repaired. In particular, the human operator informs the GUI that a failure has occurred only after that human operator forces the failure; and the human operator informs the GUI that a recovery has occurred only after that human operator restarts the offending processor.

Two plausible paradigms for communicating with a failure detector (whether implemented by an oracle or in some more realistic manner) are:

Each processor or system component P registers with the failure detector, specifying components of interest. If one of these components of interest should subsequently fail, then the failure detector sends P a message announcing this event.
A processor or system component P can send a message querying the failure detector about the status of some other component P'; the failure detector replies with a message announcing whether P' has failed.

Pick one and implement your system using that.

Submission Procedure

All submissions should be made through CMS. CMS provides a way for you to define your group. Be advised that each group member must take an action in creating a group, and your group cannot submit anything through CMS until the group has been created.

Submit the following files (at least):

TEAM a .txt file that contains the names (and net-ids) for all team members. Also, for each team member give a 1 or 2 paragraph description of the tasks this team member performed and the number of hours this required.

README a .txt file that contains

The names and a description of the contents for the other files in the directory.
Instructions for installing, compiling, and running your software on our Windows system.
A tutorial that the grader can follow to start your software and to convince himself that your system implements the required functionality. Expect the grader to spend at most 10 minutes on this task.

LOGIC a .txt file that contains

A rationale for your choice of primary/backup protocol, if you implement a protocol that is different from the ones we discussed in class. If the protocol is not one we discussed in class then include a description (at most 2 pages) and a pointer to a publication (paper or textbook) where this protocol is discussed.
A description (at most 2 pages) of the recovery protocol that was implemented along with an informal correctness argument (at the level that might be presented during a lecture in cs5414). Give a pointer to a publication (paper or textbook) if your protocol is derived from that prior work.

LAUNCH.cmd which is a shell script that the grader can run in order to start the various components needed by your bank. The grader will then try some selected test cases on your running system to determine if it seems to be working as it should.

topology_file.txt should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

TestPlan a .txt file that describes the process and any tools (i.e. additional programs) you wrote in order to test your system. This file should also explain what tests you ran and why this was a reasonable set of tests to have run.

SourceCode A zip file containing the sources needed to compile and test your system.

Grading

Your grade will be based on the above documentation and the following elements:

Does your system correctly implement some primary/backup protocol?
Does your system correctly implement a recovery protocol?
How easy is it to follow the README file installation and sample-execution script?
How thorough was your testing procedure and how creative were you in building sufficient scaffolding (test drivers etc) to test your system?
Is the source code easy to understand and does it exhibit good structure?

CS5414 (Fall 2012) Programming Assignment Phase III Fault-tolerance Using Primary-Backup

Due: 11pm, 11/8/2012 Relative weight: 10 units