CS514 Fall 2000: Homework 4. Due in class on Tuesday, Oct. 10
 
Nasa wants to send normal computers into space. How should they deal with reliability in an environment where a bit might flip as often as once every 8 hours?
 
There are a number of
ways to approach this problem.  What I
was hoping to see is more or less a form of “triage” in which your answer would
recognize that:
§        
An unprotected
computer is a complex system
§        
The likelihood of
errors differs in different subsystems
 
In fact, we need to keep
in mind that one bit flipped every 8 hours is really pretty small.  Bugs are probably occurring this
frequently!  So in a certain sense, it
isn’t actually clear that we need to do anything out of the ordinary.
 
But my suggestion to Nasa
is that they use a mixture of simple methods to make it more likely that
failures will be detected, and then that they restart failed system
components.  Critical functions for
operating the mission and for doing this checking can live in the TMR part of
the hardware, which supposedly won’t experience problems.  Examples of checks they could run:
§        
They can periodically
compute a checksum for any part of the system that shouldn’t change while it is
in use, like the code segment of the operating system.  If a problem is found, reload that segment
off the disk.  The disk uses error
correcting codes in hardware hence should be less prone to problems than the
RAM memory.  The longer something sits
in memory the bigger the risk that it will be corrupted so they should also
push things to the disk quickly and not use caching as aggressively as do
normal computer systems.  Nasa should
aim for a job mix in which stuff that sits around for a long time checks itself
and is able to restart itself now and then, if bugs are detected, and where other
things get into memory, run fast, and get out fast.
§        
They can run
applications twice (or put two copies side by side) and compare the
outputs.  If they differ, delete the
outputs and just try again.
§        
They could use a fancy
scheme like the one employed by Hermann Kopetz in his work on the MARS system
in Austria.  But this is mostly for
real-time applications and Nasa might be more focused on mundane stuff like
image processing.
§        
They should certainly
use self-checking mechanisms like “keep alive” checks, so that if something
does experience a problem it has a good chance to notice the problem and
restart itself.  The system as a whole
needs fault-tolerance mechanisms to make restart easy and quick.
§        
They might be able to
hack the compiler to generate redundant code and duplicated data.  This is a different way to accomplish the
basic idea of running applications twice, side by side.  Perhaps a bit more complex?
 
Bottom line, though, is
that 1 bit flipping per 8 hours is very uncommon and a level of errors that
probably occurs in pretty much any computer system!  We don’t think of computers as unreliable in this sense, and we
do want to protect against such problems. 
Nasa may be more directly confronted with the reality of the situation,
but relatively mild solutions may be quite adequate for their needs!