Phase III: Availability Through Primary/Backup
Due: 11:59pm Wednesday, 11/4/2009
General Instructions.
Students are required to work together in teams.
You may work in the team you used for Phase II or you may form a new team.
An assignment submitted on behalf of a "team" having fewer than 2 or
more than 5 students will receive a grade of F.
All members of the team are responsible for understanding the entire
assignment.
This assignment expects you to extend the code for some Phase I
implementation.
Feel free to use code or ideas from
any team's solutions to any prior phase in designing those extensions.
However, submissions that contain extraneous code (such as code needed for
implementing Phases II but not needed in Phase III) will
be penalized.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is
prohibited and will be treated as a violation of the University's
academic integrity code.
Background: Making Branch Servers Available
In the distributed banking system built for Phase I, bank accounts (and the
funds they store) at crashed
branches are unavailable until the faulty host has been repaired and restarts.
Few bank customers would be willing to tolerate such outages often, if at all.
More generally,
the increasing dependence by business on computers for daily operations means that
high-availability is no longer a requirement only for
life-critical control settings.
Implementing availability in our banking systems is thus not an atypical concern.
The primary/backup approach is one way
that the availability of an application can be
enhanced.
A server (along with its state) is replicated;
client requests are sent to and processed by the primary server,
with state updates forwarded to all backup servers.
And if the primary server fails, then (i) one of the backup servers
assumes the role of primary server and (ii) clients are notified about the
identity of the new primary server.
In this phase of our CS5410 project, you will program such a primary/backup
system, replacing each branch server by a service with equivalent
semantics but exhibiting higher availability.
Thus, this phase will give you an opportunity to master and get hands-on
experience with a primary/backup protocol.
What to Build
The computing environment for this phase is characterized by the following assumptions.
- The system satisfies the assumptions of the synchronous model of distributed computing.
- Processors exhibt crash failures (only).
- No branch GUI ever fails.
- Each communication channel is bidirectional.
- TCP/IP and UDP communication channels are available.
Run experiments to ascertain information you need
in order for your protocols to exploit the synchronous model:
message-propagation delays,
processor speed differences, and clock synchronization errors.
Employ either TCP/IP or (the less expensive) UDP as
appropriate for communication between components of your distributed banking
system.
As before, you will simulate a distributed system by running all components
on one real computer.
Simulating Processor Failures.
Extend the branch GUI from Phase I
with a "button" that instigates the failure of all software
running on the same (simulated) processor as this branch GUI
except for the branch GUI itself.
Each component should, upon failure: halt,
wait a random period (ranging from 15 seconds to 2 minutes), and then
recover.
When a component fails, information it stored in its memory is lost;
information it stored on disks is not.
An executing
program infers the failure of a component either from direct interactions
with that component or by consulting some form of failure detector service
(which you must build, if you need it and must itself be available, so carefully
consider where its components are executed).
In either case, timeouts and/or unexpected closing of a TCP/IP connection
will likely be the basis for suspecting a failure.
Upon recovery, the branch server should execute some "recovery code"
that you provided.
Primary/Backup Protocols.
There are many primary/backup protocols, and you are free to
choose among them.
Schemes that involve a single backup are somewhat simpler to build
than those that involve two or more backups, because the
former can employ simpler failure-detection and fail-over schemes
whereas the latter often require some form of agreement protocol.
Choose one of the primary/backup
protocols described in the literature
(and employ either one or two backups) or
adapt chain replication (with a total of 3 chain elements) for the banking application.
The branch server (as opposed to the entire distributed banking system)
is the software component to which the primary/backup protocol should be applied.
Replace each branch server with a
highly-available branch service by deploying and running additional replicas
(as backup servers) for that branch server.
The primary server and the backup servers
should each be deployed on distinct
existing (simulated) processors of the (simulated) distributed banking system.
Choose (simulated) processors that are able to communicate with
each other directly,
stipulating constraints on the topology of the network as necessary.
Recovery.
Design and implement a recovery protocol
so that a failed branch server replica,
upon recovery, returns to service as a backup server in the same
highly-available branch service as it previously participated.
Submission Procedure
All submissions should be made through
CMS.
CMS provides a way for you to
define
your group.
Be advised that each group member must take an action in creating a group,
and your group cannot submit anything through CMS until the group has been created.
Submit the following files (at least):
-
TEAM a .txt file that contains the names (and net-ids) for all team members.
Also, for each team member give a 1 or 2 paragraph description of the tasks
this team member performed and the number of hours this required.
-
README a .txt file that contains
- The names and a description of the contents for the other files in the directory.
- Instructions for installing, compiling, and running your software on our
Windows system.
- A tutorial that the grader can follow to start your software and to convince
himself that your system implements the required functionality. Expect the grader
to spend at most 10 minutes on this task.
-
LOGIC a .txt file that contains
- A description of the experiments used to quantify the
timing and delay parameters characterizing your
synchronous system, along with your experimental data.
Also, give the parameter values
you use for failure detection and justify your choices.
- A description (at most 2 pages) of the specific primary/backup
protocol that was implemented and a pointer
to a publication (paper or textbook)
where this protocol is discussed.
- A description (at most 2 pages) of the recovery
protocol that was implemented along with an informal
correctness argument (at the
level that might be presented during a lecture in cs5410).
Give a pointer to a
publication (paper or textbook) if your protocol is
derived from that prior work.
- A description (at most 2 pages) of the failure detection
service you built, along with an informal
correctness argument (at the
level that might be presented during a lecture in cs5410).
Give a pointer to a
publication (paper or textbook) if this protocol is
derived from that prior work.
-
TOPO should specify an interesting interconnection topology for a
multi-branch bank
that will be used to illustrate the operation of your system.
-
TestPlan a .txt file that
describes the process and any tools (i.e. additional programs)
you wrote in order to test your system. This file should also explain what tests you
ran and why this was a reasonable set of tests to have run.
-
SourceCode A zip file containing the sources needed to compile and test your system.
Grading
Your grade will be based on the above documentation and
the following elements:
-
Does your system correctly implement some primary/backup protocol?
Higher grades will go to protocols that are more sophisticated in their operation,
meaning that their correctness argument is more subtle and/or they tolerate
a larger numbers of failures.
-
Does your system correctly implement a recovery protocol?
Higher grades will be awarded to protocols that are easier to understand yet
make sensible trade-offs so that the elapsed time is not too great
until a processor becomes a fully-functioning backup.
-
Is a reasonable scheme being employed for the failure detector?
How quickly are failures detected (so that length of any transient service outage
is not unnecessarily long).
How likely are false alarms and how are they handled?
-
Is a reasonable scheme being employed so that clients to the service
are notified when there is a fail-over and thus the primary
constituting their point of contact to the service is changed?
-
Were TCP/IP and UDP communications channels used as appropriate?
-
How were timing your parameters chosen?
Are they reasonable?
Is the justification credible?
-
How easy is it to follow the README file installation and sample-execution script?
-
How thorough was your testing procedure and how creative were you in building
sufficient scaffolding (test drivers etc) to test your system?
-
Is the source code easy to understand and does it exhibit good structure?