Phase IV: Availability Through Active Replication
Due: 11:59pm Monday, 11/23/2009
General Instructions.
Students are required to work together in teams.
You may work in a team you used for an earlier phase or you may form a new team.
An assignment submitted on behalf of a "team" having fewer than 2 or
more than 5 students will receive a grade of F.
All members of the team are responsible for understanding the entire
assignment.
This assignment expects you to extend the code for some Phase I
implementation.
Feel free to use code or ideas from
any team's solution to any prior phases in designing those extensions.
However, submissions that contain extraneous code (such as code needed for
implementing earlier phases but not needed in this phase) will
be penalized.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is
prohibited and will be treated as a violation of the University's
academic integrity code.
Background: Enhanced Availability and Fault-tolerance
In the distributed banking system you built for Phase III, bank accounts (and the
funds they store) are unavailable
from the instant that the primary server fails until clients
of the service have been informed of the new primary's identity.
Moreover, the primary/backup approach cannot tolerate Byzantine failures.
Active replication (also known as the "state machine approach"), though
a bit more expensive, does not exhibit these limitations.
In this phase of our CS514 project, you will employ active-replication,
replacing each branch server by a service with equivalent
semantics but exhibiting higher availability.
This phase will thus give you an opportunity to get hands-on
experience designing a service that implements active replication.
What to Build
For this phase, you are given considerable latitude concerning
assumptions you make about the environment and protocols you employ
in implementing your system.
At a minimum, however, assume a computing environment in which:
- Each branch GUI runs on its own host, and this host never fails.
A branch GUI:
-
may send messages to any individual host or any subset
of the system hosts based on user input to the GUI,
-
may receive messages from any system host, and
-
may embody a voter
to accept or reject messages containing information for display by the GUI.
The branch GUI otherwise has extremely limited processing capacity and
may not otherwise be programmed as if it were a general-purpose host.
- Other software components (servers etc.) run on hosts that can fail.
There is one of these hosts per branch.
Associated with each of these hosts is a GUI that never fails and that can be used
to cause the associated host to simulate unusual behavior (such as failure or
restart)---all programs being executed on the associated host are then affected.
- Every host appears to be directly connected to every other host.
Thus, a correct host that receives a message can determine with
certainty which host (correct or faulty) sent that message.
As in previous phases,
feel free to employ a single real computer in order to simulate all of the hosts
and the network being used by your system.
In addition to the above assumptions, define the other aspects of
your computing environment by making choices for the following
computing environment characteristics.
-
Failure model:
Choose between Byzantine, send-omissions, crash, and fail-stop.
-
Fault-tolerance degree:
Define the worst case number of host failures that may occur before the functionality
of any branch could suffer.
-
Synchronicity assumption:
Choose between asynchronous, synchronous, and synchronous with clocks that
are being kept synchronized (so you don't have to bother).
-
Replica management protocols:
Choose among any of the protocols discussed in class or that you find
in the literature.
Feel free to modify any of these existing protocols to your particular needs, but
do not design an agreement protocol or a replica-mangement protocol from scratch.
-
Recovery/Restart functionality:
Implement the functionality to reintegrate failed hosts that have been repaired
or elect not to support this functionality.
Comments about State Machine Protocols.
A branch server can be replicated by running a replica of it on hosts at other branches.
There are two ways to support a replica of branch server for branch
B1 (say) on the host for branch B2 (say):
- Modify the branch server at branch B2.
The modified branch server not only processes the operations (commands)
that it used to
(e.g. Deposit, Withdraw, Query, Transfer for
that branch's accounts) but this modified branch server also handles these operations for
branch B1.
Effectively, we are merging new branch server replicas with the branch server
that already exists at each host.
-
Instantiate at branch B2
a copy of the branch server for branch B1, and associate this copy with a new socket
at the host running branch server B2.
Note that you may have to modify your "network wrapper class" to accommodate
any additional
sockets now in use.
As far as the state machine approach is concerned,
each branch GUI should be viewed as a client.
This means that the branch GUI will now be receiving multiple responses
for each operation it initiates and, therefore, it must
convert those responses into a single one (which is then presented to the human
user).
Because the mechanism for combining responses
resides in the branch GUI, it never experiences failures.
Recall that in processing a Transfer operation,
one branch server invokes an operation at another.
Think carefully about how best to handle this once branch servers are replicated.
Although there is only one level of nested call by servers in the system,
try to devise a solution
that scales up, gracefully supporting multiple levels of nested calls.
The naive solution---having each server replica make a separate request to each other
server replica---is workable, but more-scalable solutions will receive higher grades.
Submission Procedure.
All submissions should be made through
CMS.
CMS provides a way for you to
define
your group.
Be advised that each group member must take an action in creating a group,
and your group cannot submit anything through CMS until the group has been created.
Submit the following files:
-
TEAM a .txt file that contains the names (and net-ids) for all team members.
Also, for each team member give a 1 or 2 paragraph description of the tasks
this team member performed and the number of hours this required.
-
README a .txt file that contains
- Instructions for installing, compiling, and running your software on our
Windows system.
- A tutorial that the grader can follow to start your software and to convince
himself that your system implements the required functionality. Expect the grader
to spend at most 10 minutes on this task.
-
LOGIC a .txt file that contains
- A description of the computing environment characteristics
being assumed. If your choices involve assumptions that must be
simulated (e.g., you assume clocks are being kept synchronized or
you assume a failure-detection
service), explain how this simulation is implemented.
- A description of the replica management protocols (including agreement protocols)
you implemented. Include citations (or URL's) to publications where you found
these protocols.
- A description of that protocol, if you elected to have your system support host
recovery/restart functionality. Include citations (or URL's) to
publications that contain these protocols.
-
TestPlan a .txt file that
describes the process and any tools (i.e. additional programs)
you wrote in order to test your system. This file should also explain what tests you
ran and why this was a reasonable set of tests to have run.
-
SourceCode A zip file containing the sources needed to compile and test your system.
Grading.
Your grade will be based on the above documentation and
the following elements:
-
Does your system run.
-
How hostile is the computing environment you target?
More hostile environments lead to better grades
only if the system runs.
Thus, you are strongly encouraged to choose a computing environment initially
for which the chances of your getting the system to run are reasonably good;
then, if you have more time, consider extending your system
with additional functionality (e.g.,
add recovery/restart functionality) or to run in a more hostile environment.
-
How easy is it to follow the README file installation and sample-execution script.
-
How thorough was your testing procedure and how creative were you in building
sufficient scaffolding (computing environment
simulation, test drivers, etc.) to test your system.
-
Is the source code easy to understand and does it exhibit good structure?