CS 414 Fall 1997: Homework 1 (due Sept

CS 414 Fall 2001: Homework 3 (due Oct 25)

Please type your solutions.

In solving the following two problems, assume that every computer on the network has a unique IP address and that communication between application programs is by unreliable point-to-point messages of arbitrary size (in effect, via UDP). As described below, programs can fail by crashing – halting – but not in other ways. For example, programs never go haywire and start to send corrupted messages with garbled content. Similarly, assume that the network can lose messages, deliver them out of order, and can even deliver more than one copy of the same message, but that it never damages a message. Finally, assume that there are no permanent network outages. Communication between process A and process B can be disrupted for an extended period of time, but not forever – if A hasn’t crashed and B hasn’t crashed, sooner or later some messages will get through provided that you send them again and again. But nobody knows the “time bound” on outages – a network outage could last for just a fraction of a second, but it could also last for minutes or hours.

1. Suppose that you are designing software for a new kind of computing system that will run on the Internet and will assist in automating patient care for a hospital. The idea is that patients will wear small units that can automatically administer drugs, and the hospital computing system, using protocols that run over the computer network, will adjust the dosages based on telemetry data from the patient and orders given by the doctor. Everything will be encrypted so that even though the data passes over the Internet, intruders can’t make sense of these packets or modify them in any way.

a) Any device can fail – these little devices, for example, depend on a battery pack and the battery can run low. Suppose that the device never fails midway through giving an injection, and that it fails by halting – crashing. Can a protocol be designed so that if the doctor says “inject x units of drug y” either the doctor is told within a few seconds that “the device failed and the injection wasn’t administered”, or the injection is performed exactly once and within a few seconds, the doctor is told “successful”? E.g. there should never be an “outcome uncertain” situation. Explain briefly, and if you believe the problem can be solved, give the protocol – “First, the application on the doctor’s desktop sends a message to the device containing…,” etc. Give us a short proof for the correctness of your algorithm. If you are convinced the problem can’t be solved, give a short proof that it can’t be solved.

In fact, the problem can’t be solved as stated. The difficulty is that if a device is told to inject the medication but then fails, we have no way to know what state it is in. If we could wait for it to reboot we could simply ask it then, but this could take a long time and hence violate our timing limits. To prove this more explicitly, assume that the application wishes to initiate an injection and that the device is operational. Even if we use some form of handshake, there must be some message m such that before m is sent to the device, no injection can have occurred yet, and after m is received, the injection should take place. Suppose that m is sent, but that no response is received within the time limit. There are several possible explanations for this situation. Perhaps the device failed, or the message was lost, and this happened prior to the delivery of m. (Even if you use a reliable protocol like TCP, m might still be in the channel, so this remains true). Perhaps, on the other hand, m was delivered, the injection was performed, and then when the acknowledgement or reply was being sent, the device or network failed. Thus, there are two indistinguishable system states, in one of which the injection occurred, and in the other it didn’t. It follows that the problem is not solvable.

b) What sorts of non-technical (human) roles need to be included in the kind of hospital system we’re talking about? Keep in mind that on the one hand, the hospital must keep expenses to a minimum. On the other hand, the hospital also must keep patients safe. Your answer should be reasonably brief – a paragraph or two, not pages. Base your answer on technical observations about the feasibility or infeasibility of solving technical problems (like in a), but try and translate those observations to conclusions about how to run the system safely.

In light of the answer to part (a) we clearly can’t rely entirely on technology. There needs to be a way for a medical technician, a nurse, or for the doctor him or herself to reach the patient in the event that a failure of this sort occurs. In the limit, no “solution” can really work because of the problem described in (a). Our focus should be on a solution that minimizes false alarms – that doesn’t send a nurse running to check the patient more often than necessary.

An example of the way that real hospitals solve this problem is as follows. Before initiating an injection, a record is stored on at least two servers saying that the injection request is about to be sent to the device. (In some cases the hospital buys special hardware, like a Compaq fault-tolerant server, that has the two computers built into it). Anyhow, once the record is stored, the device is told to do the injection (as above, this could take a few messages). Then the system logs that the injection was successfully administered. If something crashes during this protocol, there are two cases. One is a crash on the hospital side. But since we are using two servers (presumably on separate power supplies), at least one will still be running, so we’re ok – we log that the injection was completed and tell the doctor “done”. You can of course question this assumption, but the track record for Compaq’s fault-tolerant computers (used to be Tandem) is pretty good and this assumption is considered reasonable.

The other, harder case is the one that stumped us in part (a). Now we need to detect the problem by using a timer, then report to the nursing station or to a doctor that a technical failure has occurred while an injection was in progress. A buzzer might even sound. Until a human clicks “ok” this box would sit on their display and it would include the phone number and address of the patient. Then a nurse can call, see if things are ok, and log the details in the hospital database. Obviously, the buzzer will need to sound with enough time left before our “deadline” so that the nurse actually has time to do the check. But in practice these deadlines are normally long, like 15 minutes or half an hour. And normally one expects a response from the device within seconds. Thus there is normally ample time to realize that the device is broken. In fact, the normal situation is that if the device was healthy enough to exchange messages with the control system, it probably was also healthy enough to say “something went wrong when I went to do the injection.” That is, the failure would probably turn out to be a mechanical one, not a CPU crash. Thus in most cases the device will actually be able to say “I failed” and the scenario where the nurse needs to check is almost never a false alarm. At any rate, this is how real hospital systems really solve the problem.

We probably need one more mechanism. Ideally we want the system to be “fail safe”, meaning that when a failure does occur, the outcome is still safe for the patient. In our case, this means the nurse needs a way to figure out if the device injected the medication (otherwise the medication might be administered twice, or not at all). To solve this problem, the device itself probably needs a way to indicate its last successful action – even if it fails. Perhaps this would be visible on an LED of some sort. In the limit, perhaps the level of the drug in the reservoir of the device could be checked to see how many doses have been administered from it. But however we solve the problem, a fail-safe device does need a mechanism by which a nurse or a technician can physically look at the device and figure out if the drug was administered or not.

As an aside, these sorts of issues make it especially hard to build medical devices that will be implanted, like pacemakers or cardiac assist pumps. Not only do they need to solve the medical problem, but they need to fail only in ways that are still safe for the patient – a tricky question if the device is in the patient’s chest!

All of this sounds complex, but the bottom line is that to really achieve safety, we simply need a mixture of mechanisms. What real systems actually do is to use technology up to the limits we hit in (a), but then provide a way for people to step into the loop and figure out where things stand.

2. Suppose that you have a set of 3 email servers {s₀, s₁, s₂} and want to send an email in a manner that is guaranteed to get through reliably provided that no more than 1 server is ever down at any point in time. You can assume that a recovering server is solidly up again before any other failure occurs. Thus when an email is sent there may be 2 servers running or all 3 may be running. There would never be less than 2 servers running. Describe a protocol for sending a mail message, a protocol for checking for mail, and a protocol for deleting a mail message which will work correctly as long as these assumptions are satisfied. In designing your protocol you can add additional information to a mail message if necessary, such as the IP address of the sending machine, counters, the time, etc. But tell us what this extra information is, and use a Java-like notation for any code that you actually write down.

We got some very elaborate software solutions to this problem. Oddly, though, most were focused on the details of how to use Java to send a MIME-encoded email message to a server using a protocol like SMTP, IMAP or POP3. The real hope (not many people got this idea, though) had been for solutions using the following general approach.

Each email server should implement operations called int put_mail(string username, string mail), delete_mail(string username, int mailid) and maillist check_mail(string username). Our main challenge is to deal with failures, and as part of this we’ll use an extra interface, commit_mail(int mailid). In what follows we won’t even consider the case where more than 2 servers fail, or where the client fails. Our goal is to tolerate just a single failure. Moreover, we won’t try to guarantee what is called “liveness” – we won’t guarantee that our solution actually runs in any fixed time limit. As mentioned in class there is a famous theorem proving that you can’t actually guarantee liveness in a fault-tolerant system without an accurate way to sense failures, so the approach I take here is about as good as one can do. In practice, of course, the solution should work reasonably well.

To send an email from “who” with body “what”, we’ll just use two stages. First, try to store the email at each server. Under our assumptions, at least two will be running, so we just wait for two replies. To do this we need a separate thread for each put… I don’t have an easy way to write the code for this and hence used a sort of pseudo-notation. What we really want to do here is have each of three threads attempt the put. As the put replies, increment a “success count”. When it reaches two, break out of this code and terminate the third attempt.

More assumptions: I’ll assume that sending a message from the client to the server involves passing it to a protocol like TCP that keeps trying (forever) until the data gets there. On the other hand, we’ll also assume that the client gives up on waiting for replies after some amount of time delta. This is a funny mixture of approaches – trying forever in the TCP protocol down in the OS kernel and yet timing out at the application level where the server and client have a direct dialog underway. For our purposes, though, it seems to do the trick.

void send(string who, string what)

{

int success_count = 0;

foreach(mailserver Î {s₀, s₁, s₂})

{

AS A SEPARATE THREAD

if((ids[mailserver.index] =mailserver. put_mail(who, what)) != -1)

++success_count;

if(success_count > 1)

/* OK to terminate the remaining put_mail */

BREAK AND TERMINATE THE THIRD THREAD

}

This solution works because we assumed that the client stays up, and we assumed that two servers will be operational. The third might not see our put request. Thus, when the send is finished, the email has been stored on two or three servers.

To check for email, ask each server for a list of mail that it knows about. Again, we assume that at least two are up. Same awkward notation used above. Then, discard duplicate items, regonizable by their id’s and return the list. For simplicity assume that a maillist contains the body of the emails. Notice that since one server may have been down when we stored the mail, any mail will be available in at least one server but perhaps not two. The mailserver.checkmail function should only list emails that are committed.

maillist check(string who)

{

maillist list =EMPTYLIST;

int success_count = 0;

foreach(mailserver Î {s₀, s₁, s₂})

{

AS A SEPARATE THREAD

maillist mlist = mailserver.checkmail_mail(who);

if (mlist != ERROR)

++success_count;

if(success_count > 1)

BREAK AND TERMINATE THE THIRD THREAD

}

/* sort list, then delete any duplicates or mails listed as “deleted” by one server and yet shown as in the list by another */

sort_and_fix(list);

return(list);

}

To delete an email, just ask two or more servers to do so. Same idea.

void delete(who, id)

{

int success_count = 0;

foreach(mailserver Î {s₀, s₁, s₂})

{

AS A SEPARATE THREAD

if(mailserver.deletemail_mail(who, id) != SUCCESS)

++success_count;

if(success_count > 1)

BREAK AND TERMINATE THE THIRD THREAD

}

Not shown here, but needed, is the code to deal with server recovery. If a server crashes, you can imagine that while it is down, someone might delete emails. When it recovers, for this reason, a server needs to verify that for each id it knows about, at least one of the other servers knows that id too. In fact this can be ignored in the sense that our “cleanup” routine, sort_and_fix, would notice the delete item if we have a notation to represent it. But that approach is also wasteful of space and requires that a deleted item be “remembered”. Better is to check on recovery: if both other servers are up, and yet neither knows an id, the id was deleted. Obviously, this means that a server can’t recover while one or both of the other servers are crashed. But this is the unavoidable cost of high availability!

Finally, I haven’t shown the code to actually send mail to a server, particularly if you want to be IMAP or SNMP or POP3 compatible. I was astonished to see that some people actually handed in code at that level of detail. It wasn’t needed and isn’t really part of the problem as stated in this homework. We’ll give lots of credit for such a solution (unless you cut and pasted from an open source implementation of one of the protocols), just to honor the awesome hacking ability (assuming this was your own code) of the people who provided these mega-solutions. But if you wrote one of them, keep in mind that it is critical, in distributed computing, to not lose track of the forest for the trees. The problem stated here was focused on how to interact with 3 replicated servers, 1 of which might be down. In some sense, extensive chunks of code copied from the open source implementations of IMAP and so forth just aren’t relevant to the question as it was posed…