Phase III: Fault-tolerance Through Message-Logging

Due: 8:00am Tuesday, April 5.

General Instructions. Same as for Phase II.  This assignment builds on work that was done in Phase I. Feel free to use any team's solution to Phase I as the basis for your solution to Phase III. No late assignments will be accepted.

Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.

Background: Disaster Recovery for Bank Branches

In the distributed banking system built for Phase I, account balances are stored in the volatile memory of processors --- not on secondary storage. If the processor at a branch crashes, then access to information about those balances is not only lost while the processor remains unavailable but can be lost forever. Neither the bank nor its customers are likely to be happy about such a loss of account information.

Use of a message logging protocol is one way that a branch server could recover information concerning that branch's accounts after a faulty processor has been repaired. Customers might not be happy about losing access to their accounts while "the computer is down" but at least there would be no irrevocable consequences of a processor failure.

In this phase of the cs514 project, you will program a message-logging protocol (of your choice) for the branch servers of the distributed bank implemented for phase I. This exercise should:

Notice that the book doesn't treat message logging protocols.  To solve this part of the project we expect you to read the papers cited in our course "outline" page and to implement those protocols by hand. 

What to Build

For this phase, assume:

Branch GUI and Branch Server. Extend the branch GUI from Phase I with a "button" that instigates the failure of the corresponding branch server. Activating this button should cause the branch server to simulate a crash failure and recovery by doing the following.

  1. Send a message to each of neighbor, announcing that this branch server has failed.  (So: this is really the fail-stop model, since it includes notifications that can be trusted).
  2. Cause the branch server application to terminate (perhaps by sending a suitable message to a modified version of that branch server).
  3. After some time has passed (say 30 seconds) since the branch server has terminated, cause a new copy of the branch server to start and run some "recovery code".
  4. During this period, communication in the whole bank might be disrupted.   If the “routing table” offers alternative routes (e.g. A could also talk to B via X) you should switch routes after a short period of trying to overcome the loss (like 250ms).  If there is only one route and it goes through a failed server, branches will just have to keep trying until that server recovers.  Similarly, clients will lose their connections and will need to reconnect.

Message Logging Protocol Implementation. Also, write "recovery code" for the branch server. Execution of this code enables the branch server to obtain correct and current account balances before that branch server commences processing subsequent requests for the various operations it supports.

As mentioned in class, descriptions of some message-logging protocols and citations to relevant literature can be found in:

Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. Technical Report. Available from http://www.cs.utexas.edu/users/lorenzo/papers/Pap6.ps or locally postscript or pdf.

If you can find a way to map our problem to something we tackled in class, or that you find discussed in this paper, please implement the standard solution.  We do not think that you need to invent a totally new solution in order to solve the problem that arises in our bank system.

Submission Procedure. As usual, but include:

README which contains

·         The names and a description of the contents for the other files in the directory.

·         Instructions for installing, compiling, and running your software on our Windows-NT system.

·         A tutorial that the grader can follow to start your software and to convince himself that your system implements the required functionality. Expect the grader to spend at most 10 minutes on this task.

LOGIC which contains

·         A brief description of the specific message-logging protocol that was implemented and a pointer to a publication (paper or textbook) where this protocol is discussed.

·         An explanation of what other protocols were considered and why this protocol was selected. Discuss those aspects of the application or setting that impact the choice of protocol. Give strengths and weaknesses of the protocol you selected as compared with alternatives.

·         A characterization of how many processor failures your protocol can tolerate (relate this to network topology if appropriate).

TOPO should specify an interesting interconnection topology for a multi-branch bank that will be used to illustrate the operation of your system.

TestPlan should describe the process and any tools (i.e. additional programs) you wrote in order to test your system. This file should also explain what tests you ran and why this was a reasonable set of tests to have run.

 

Grading. Your grade will be based on the following elements:

Be warned: It is tricky to go beyond tolerating a single failure and a topology where every branch is the neighbor of every other branch. Don't try this until you have a working solution for the case where (i) there is at most one branch that has failed and not restarted and (ii) all branches are neighbors of each other.