Homework assignment 1

Homework assignment 1: Due in 3 weeks, on Sept 22. Hand in solutions via CMS. START SOON!

A common problem in data center settings is to implement a service that has some form of leader. The hardest part is bootstrapping: the first server to be launched should become the initial leader, while subsequent servers should be backups, waiting to ascend to the throne if the leader fails. But once a system is running, the issue of monitoring leader status and managing the fail-over is also potentially tricky (see class notes on "split brain" problems).

Implement a solution to this problem in Windows .net using C# or C++ in Visual Studio. Use the Windows socket interface and UDP message passing for all communication. Design your solution to have two "sides": a library that could be reused by others, and an application that has a user interface showing the system state, and that uses the library. Think hard about the best way to handle the initial rendezvous, and make sure to address races in which two servers are launched simultaneously.

Note: Actually, we won't be upset if you work on a Linux platform or use Java. But we won't be able to provide nearly as much help if people deviate from our main recommendation.

Some rules of the game:

Your solution will take the form of a library of routines that can be used by various applications. (In Windows, a library is often referred to as a "DLL").
An application using the DLL will need to tell it its "name", like "lock service" or "current inventory". A file-pathname style of name would be best (i.e. "/amazon/us-nw/inventory"). Each distinct application would be handled by the DLL separately, having its own leader. The DLL per-se should support large numbers of applications, and it should be possible to run many applications on a single machine for purposes of testing.
You will need to convince yourself that the solution comes as close as possible to ensuring that there is always a single leader, unless the service shuts down entirely. If your implementation might sometimes have no leader, or multiple leaders, convince yourself that the condition can't last long. Ideally, come up with a time limit, like "2 seconds", such that if a problem of this nature arises, it will resolve within that time limit.
Your API matters. Design your DLL as if it will become a widely used standard on which the entire fate of Google, Amazon, or the world financial system might depend. Think hard about elegance, simplicity, and clarity.
Decide how the DLL will notify the application when status changes, e.g. when a new member joins, or when the leader role changes.
Protect your solution against multithreaded access. Threads are common in modern systems and your solution should be thread-safe. In your own case, threads arise (at least in Visual Studio C#) because otherwise once the user presses the "run" button, the application will be unresponsive to console input if you don't launch a separate thread to run the code. A warning though: launching a thread can have annoying consequences by making the application harder to debug and introducing some annoyances, such as difficulty accessing Windows Forms controls like text boxes (they can only be updated from the thread that created them, or using a special "thread.invoke" operation with a call-back function as an argument: not hard, but irritating to get right).

.... plus one very important rule about doing your own work:

This assignment is to be done individually. You may discuss your approach with others in the class, but all aspects of the implementation must be entirely your own work, except for code cut-and-pasted from Visual Studio itself. You must not show your code to others in the class, and if you help someone debug their solution, limit yourself to suggestions, not touch the keyboard!

We expect you to implement and test your solution, evaluate its performance and scalability, and document the whole story.

A few hints: You will probably be confused about what is and isn't permitted for the very first messages your application sends, because at that initial step, the program has no idea whether there are other instances of the same application running or not. Here are options to consider. You can pick one, or rule them all out and do something of your own. The more flexible and general the better.

UDP broadcast. In effect, when starting up the program can broadcast "hello out there!" and if anyone receives the message, the receiver can reply "hi, welcome to the group". If you do use this approach, think about the issue of UDP class-D addresses and port numbers: how will you pick them? As a quick comment, Professor Birman really likes this approach because no special extra services are needed and you only need to implement one application. So if all else is equal, this is what he recommends. Use a small value for the TTL field when you send your UDP multicast, like TTL=1, to ensure that messages can't "flood" the CSUG lab!!! Also, pick a very random kind of port number and even so, think about how to deal with incoming UDP multicast message sent by some other student (who accidentally picked the same port number). If junk of that sort shows up, your program needs to filter them out. For example, you can send messages with your netid in a header and reject incoming messages that don't start with your netid -- anything like this would protect against surprises. Also do keep in mind that UDP multicast is unreliable and packets can be dropped. This is very rare, but it argues for sending two or three times before assuming that your program is the first service instant to be started up.
Rendezvous through a web service. You could build a little service to help out just at the startup. Applications would "touch bases" through it, and it would hand out suggestions along the lines of "If <IP,port> is still running, he's in the group." Think about races where processes run for very short periods of time (like fractions of a second). Could someone have trouble joining because the service is always very out of date? Comment: Professor Birman thinks this is not as good a solution because it forces you to build two applications. But you would get experience with web services this way, so he's ok with this if you understand it better. You would need to hard-wire the location of the service into your main application.
Rendezvous using DNS in much the same role as case 2. Hard subproblem: gaining program control over DNS entries in Windows isn't as easy as you might wish. Second (easy) problem: application group names would need to look like and behave like machine names, so that DNS can understand them. This is ok -- it reuses an existing technology that definitely has to be working for the Internet to be available at all. But talking to the DNS at this level could be a whole project in itself.
User types in rendezvous information through a console interface. Think about the challenge of convincing the TA that this is actually a useful option. What about the issue raised under "idea 2"? This is kind of cheating but could be a good way to get started, so that you can FIRST do the leader selection/monitoring technology and only then add in the startup code...
Fixed command-line arguments or configuration files. See comments on item 4.

More hints that aren't related to startup:

If you use the UDP multicast approach (point 1 in the first set of hints), your entire solution can take the form of a single DLL plus a single application that demos the technology. The application would be a "Windows forms application" with a nice user interface. Some ideas for the interface
- It can have a few "textBox" controls into which one enters the application name, the port number it will use, timeout parameters, etc.
- A "button" can tell it to "Join". This way you can launch lots of copies and then type in the needed parameter values on each, and then tell them to join at the frequency and in the order that you find convenient. So, you would start a few instances up, perhaps on a single machine, and then click "Join" on a few of them..
- You can either have an "exit" button, or just click the "x" in the top-right corner to kill the application off
- It can show who the leader is using a nice color-coded message displayed into a textbox (maybe "Red" means "not me" and "Green" means "I'm the leader"). You just need to set the background color for the textbox control. Change the color when the status changes.
- You'll need to check periodically for input. Although you can do this with threads, it may be easier to just set up a system timer event and have it fire perhaps ten times per second. Then you can do any time-based actions when the timer event handler is called, such as checking for incoming messages, sending messages, etc.
- Use the "WinSock" message interface to create a socket, bind an IPMC address to it (if you use option 1), and to send/receive/poll for messages.
If you DO use threads, be aware that only the main thread can access the controls in your form application. You need to use something called the "invoke" method to access a control from a thread other than the one that created it. This is a pain, although not to the point of being a crippling show-stopper. Use the Visual Studio help to get to the code you need to cut and paste if you wander down this path...
Again, use a small TTL value when sending a UDP multicast. The value 1 will be fine. Do not ignore the advice or your application could behave in a way that would disrupt the Cornell network!

Hand in via CMS (the CS Course Management System): a single document, ideally in PDF format (you can create multiple PDF files and then combine them if you like), with your name and netid on the front page. It should include:

A short writeup explaining how the solution works. No more than 1 1/2 pages in length, not counting illustrations. If you include them, they should be drawn to the same quality standards you would expect from documentation of a library you might find yourself using.
Documentation of the library API, mimicking the style of documentation used by Visual Studio when you use the help interface
A summary of the performance evaluation, giving the latency associated with launching a new member of an application group graphed as a function of the number of applications running on the platform, and the latency of fail-over when the leader crashes. Note: If evaluating scalability as a function of the number of application groups turns out to be too hard, you can just measure performance for a single group and report the amount of time for launching the first member, for launching additional members, and for handling failures. So: one group is enough, but evaluations for large numbers of groups would be way cool and would impress us. A real library would need to work for large numbers of groups, but you'll do fine in CS5140 even if you can't evaluate that case...
A short discussion of the behavior of your solution if the leader has a transient fault but then resumes "normal" operations
A printout of the library code

Our course TA, Jonathan, may request a demo of your solution or pose other questions about it.

Can't finish on time? Make sure to ask Jonathan for help!