General Instructions. Work in a group comprising 1 to 3 students. All members of the group are responsible for understanding the entire assignment and submitted solution.
You may work in the team you used for some previous phase, or you may form a new team.
This assignment builds on work done in Phase I. Feel free to use any team's solution to Phase I as the basis for your solution to Phase III.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is prohibited and will be treated as a violation of the University's academic integrity code.
The primary/backup approach is one way that the availability of an application can be enhanced. A server (along with its state) is replicated; client requests are sent to and processed by the primary server, with state updates forwarded to all backup servers. And if the primary server fails, then one of the backup servers assumes the role of primary server.
In this phase of our CS5414 project, you will program such a primary/backup system, replacing each branch server by a service with equivalent semantics but exhibiting higher availability. Thus, this phase will give you an opportunity to master and get hands-on experience with a primary/backup protocol.
Primary/Backup Protocols. We recommend that you adapt one of the the primary-backup protocols discussed in lecture (see [vRG10] or [vRS04]). However, the literature contains numerous other protocols, and more ambitious groups are welcome to implement one of these other protocols instead.
The branch server (as opposed to the entire distributed banking system) is the software component to which the primary/backup protocol should be applied. Thus, you should replace each branch server with a highly-available branch service by deploying and running additional replicas (as backup servers) for that branch server. The primary server and the backup servers should each be deployed on distinct existing Java virtual machines of the (simulated) distributed banking system. Choose Java virtual machines that are able to communicate with each other directly, stipulating constraints on the topology of the network as necessary.
Configuration Management. The correct operation of a primary-backup system requires that clients and servers have some idea of the system configuration, including which server is the primary and which servers are backups. (If chain replication is used, then the configuration defines which servers are the head and the tail, as well as defining the successor and predecessor of each server in the chain.) Configuration information can be computed by components separately and in isolation if (i) each component has access to a failure detector and (ii) failures are the only events that cause changes to the configuration. This is the approach you should employ.
Recovery. Design and implement a recovery protocol so that a failed branch server replica, upon recovery, returns to service as a backup server in the same highly-available branch service as it previously participated.
Failure Detection. For this phase, create in software an oracle that simulates a centralized, fault-tolerant, perfect failure detector. Components of your system should be able to learn from this failure detector whether a given processor or system component has failed.
The oracle, in fact, is not expected to do any real detection at all---it should simply forward information that is provided using a new system GUI that a human operates while the banking system runs. We (unrealistically) assume some human operator completely controls all failures in the system and also is gracious enough to update the failure detector accordingly whenever such a failure is instigated or repaired. In particular, the human operator informs the GUI that a failure has occurred only after that human operator forces the failure; and the human operator informs the GUI that a recovery has occurred only after that human operator restarts the offending processor.
Two plausible paradigms for communicating with a failure detector (whether implemented by an oracle or in some more realistic manner) are:
Pick one and implement your system using that.
Submit the following files (at least):
LAUNCH.cmd which is a shell script that the grader can run in order to start the various components needed by your bank. The grader will then try some selected test cases on your running system to determine if it seems to be working as it should.