CS514 Fall 2000: Homework 4.  Due in class on Thursday October 12

 

A real problem posed by NASA space scientists.  In the past, Nasa used expensive duplicated CPUs and so-called triple-modular-redundant memories (a three-way memory that votes on each output) in space vehicles.  This approach was needed because in space, cosmic rays tend to flip bits in a computer memory or CPU with some low frequency: perhaps, 3 or 4 times per day.

 

This is much less of a problem for data stored on disks, because of the error correction codes implemented by the disk hardware.

 

But as computers get faster and cheaper, it is harder and harder to work with slow, old, expensive computing technologies.  Starting five years from now, Nasa won’t be able to afford this kind of special hardware and will forced to build space systems that contain normal computers, like the Linix or NT machines we use at Cornell.  On these, they need to run applications ranging from moderately critical (like trajectory planning) to mundane (like image processing and compression).  Very critical tasks like firing the thrusters would still be run on redundant, TMR-protected hardware, but the hope is to minimize the need for these kinds of special components.

 

How would you recommend that Nasa solve this problem?  Keep in mind that their goal is to work with cheap, off-the-shelf components.  Proposals that they build new kinds of chips or that they rewrite the operating system will be considered but Nasa will certainly favor a clever, cost-effective solution over a very expensive or very complex one.

 

Goals for the Nasa solution include fault-tolerance (tolerates a low rate of random bit-flipping), correctness (when this happens, it won’t hurt the quality of the output), and performance (ideally, things won’t slow down by more than a factor of two or three).  Keep in mind that those bits can be instructions, not just data, and can be in the operating system, not just the application!

 

Rules of the game: you can put some code in the highly protected part of the machine, but nothing really big.  You do have access to the source code for the operating system or for some of the libraries, but you don’t want to rewrite all of NT as part of your solution!  You certainly can tell, at runtime, which are the code and data segments for the OS and the applications.  In Linux, this corresponds to having unrestricted access to /dev/proc.  Finally, the low-priority tasks that are of interest to us can simply be restarted if something goes wrong.  They don’t fire thrusters or anything; they are more typically programs that extract features from images or convert raw data into processed information.

 

Please type your answer.  Maximum of 1½ pages of text, but a ½ page figure would be accepted if needed.