CS514 Fall 2000: Homework 4. Due in class on Thursday October 12
 
A real problem posed by NASA space scientists.  In the
past, Nasa used expensive duplicated CPUs and so-called
triple-modular-redundant memories (a three-way memory that votes on each
output) in space vehicles.  This
approach was needed because in space, cosmic rays tend to flip bits in a
computer memory or CPU with some low frequency: perhaps, 3 or 4 times per day.
 
This is much less of a
problem for data stored on disks, because of the error correction codes
implemented by the disk hardware.
 
But as computers get
faster and cheaper, it is harder and harder to work with slow, old, expensive
computing technologies.  Starting five
years from now, Nasa won’t be able to afford this kind of special hardware and
will forced to build space systems that contain normal computers, like the
Linix or NT machines we use at Cornell. 
On these, they need to run applications ranging from moderately critical
(like trajectory planning) to mundane (like image processing and
compression).  Very critical tasks like
firing the thrusters would still be run on redundant, TMR-protected hardware,
but the hope is to minimize the need for these kinds of special components.
 
How would you recommend
that Nasa solve this problem?  Keep in
mind that their goal is to work with cheap, off-the-shelf components.  Proposals that they build new kinds of chips
or that they rewrite the operating system will be considered but Nasa will
certainly favor a clever, cost-effective solution over a very expensive or very
complex one.
 
Goals for the Nasa
solution include fault-tolerance (tolerates a low rate of random bit-flipping),
correctness (when this happens, it won’t hurt the quality of the output), and
performance (ideally, things won’t slow down by more than a factor of two or
three).  Keep in mind that those bits
can be instructions, not just data, and can be in the operating system, not
just the application!
 
Rules of the game: you
can put some code in the highly protected part of the machine, but nothing
really big.  You do have access to the
source code for the operating system or for some of the libraries, but you
don’t want to rewrite all of NT as part of your solution!  You certainly can tell, at runtime, which
are the code and data segments for the OS and the applications.  In Linux, this corresponds to having
unrestricted access to /dev/proc. 
Finally, the low-priority tasks that are of interest to us can simply be
restarted if something goes wrong.  They
don’t fire thrusters or anything; they are more typically programs that extract
features from images or convert raw data into processed information.
 
Please type your answer. Maximum of 1½ pages of text, but a ½ page figure would be accepted if needed.