CS514 Fall 2000: Homework 1. Due in class on Tuesday, Sept. 12

1. Read the following paper: H. Saltzer, D. P. Reed, and D. Clark, "End-to-end arguments in system design", ACM Transactions on Computing Systems, vol. 2, no. 4, 1984. Note that you can obtain this online through our library, which has a digital library access agreement with ACM (http://www.acm.org). Hard copies of the journal itself are also available in Carpenter Library.

a. Suppose that you are developing a network for connection to mobile users who have wireless receivers. The loss rate on the wireless link turns out to be several orders of magnitude larger than for links within the “land-based” part of the network infrastructure. According to the end-to-end philosophy, should error correction be done on the wireless link, or should all error-correction occur end-to-end?

The paper argues that if loss rates are uniform and low in the network, one should do all error correction between the end-points. But in this case one link is a disproportionate source of loss, so we should do correction close to that link to bring it up to the same statistical behavior as the rest of the network.

b. In class, we saw that routers drop packets to signal to TCP when overload occurs. Recall that routers apply this policy “early”, under the random early detection (RED) approach. In a strict end-to-end world, shouldn’t the router be completely indifferent to the protocol running end-to-end near the application layer? Does RED violate the end-to-end philosophy?

Actually, I would say that RED does violate the end-to-end philosophy. Obviously, the philosophy is expressed in terms of packet loss and recovery techniques, not throwing packets away. But the fact remains that RED is more or less a part of the end-to-end flow control mechanism that has been moved down into the network. I suppose that the counter argument is that we can’t do this equally well between the end-points, but personally, my feeling is that end-to-end yielded to pragmatism in this specific case.

2. Remote Procedure Call (RPC). Give pseudo-code in a Java-based notation to solve the following problem, or explain why the problem cannot be solved. A certain device is controlled by a set of 5 computers, each of which operates a motor (this is common in factories, where large pieces of equipment might have many controllers and motors to operate them). Your job is to program a central controller (a sixth machine) to start all the motors at once.

The procedures available are:

check_status(motor-number), returns {MTR_OK, MTR_FAULT}

(for motor-number in 0..4)

set_motor_status(motor-number, status) returns {MTR_OK, MTR_FAULT}

(for motor-number in 0..4; status in {MTR_ON, MTR_OFF}

Your solution should work as follows: if all the motors are OK, then set all motors to MTR_ON, checking the returned status code for possible new faults. If any motor is faulty at the outset, or if any motor fails when turned on, all motors should be turned off. Assume that in addition to the normal behavior of these procedures, they can throw a timing exception, which means that the RPC has timed out. You may assume that no real software failures occur – the various control programs don’t actually crash. However, with some non-negligible probability the network can lose a packet, programs can execute slowly for lots of reasons. Assume that the motors themselves have a significant failure rate.

Is a problem such as this posed correctly? Can you suggest a better (“practical, and adequate”) specification that avoids the issues raised by this specification?

As stated, the problem definitely can’t be solved. Consider just the very last step: suppose that my code decides to turn on the motors. As things are stated here, the network could drop all my packets to motor A while letting all my packets through to motors B, C, D and E. So there isn’t any point in giving pseudo-code for the solution; it can’t be done.

But if we relax the rules we can use 2-phase commit (I won’t reproduce the code, it looks just like the one we saw in class). The relaxed specification should state:

a) We assume that if a packet is retransmitted enough times, it will get through with arbitrarily high probability. Thus, the network doesn’t suffer partitioning failures.

b) We assume that the control program is running on a different machine than the ones on which the motors run, and that failures can be detected with some high probability.

c) Our goal is that with some specified probability of error p, either all motors are turned on, or no motor is turned on. For example, the factory might specify p=1-10^-9. From this probability we can then compute the appropriate retransmission rates and retry limits, timeouts, etc, to ensure that indeed, the problem can be solved. However, note that our ability to accurately detect failure ultimately limits the probability that the problem can be solved.

In summary, if we weaken the goal to become probabilistic, a 2PC protocol suffices.