CS514 Fall 2000: Homework 4.  Due in class on Tuesday, Oct. 10

 

Nasa wants to send normal computers into space.  How should they deal with reliability in an environment where a bit might flip as often as once every 8 hours?

 

There are a number of ways to approach this problem.  What I was hoping to see is more or less a form of “triage” in which your answer would recognize that:

§         An unprotected computer is a complex system

§         The likelihood of errors differs in different subsystems

 

In fact, we need to keep in mind that one bit flipped every 8 hours is really pretty small.  Bugs are probably occurring this frequently!  So in a certain sense, it isn’t actually clear that we need to do anything out of the ordinary.

 

But my suggestion to Nasa is that they use a mixture of simple methods to make it more likely that failures will be detected, and then that they restart failed system components.  Critical functions for operating the mission and for doing this checking can live in the TMR part of the hardware, which supposedly won’t experience problems.  Examples of checks they could run:

§         They can periodically compute a checksum for any part of the system that shouldn’t change while it is in use, like the code segment of the operating system.  If a problem is found, reload that segment off the disk.  The disk uses error correcting codes in hardware hence should be less prone to problems than the RAM memory.  The longer something sits in memory the bigger the risk that it will be corrupted so they should also push things to the disk quickly and not use caching as aggressively as do normal computer systems.  Nasa should aim for a job mix in which stuff that sits around for a long time checks itself and is able to restart itself now and then, if bugs are detected, and where other things get into memory, run fast, and get out fast.

§         They can run applications twice (or put two copies side by side) and compare the outputs.  If they differ, delete the outputs and just try again.

§         They could use a fancy scheme like the one employed by Hermann Kopetz in his work on the MARS system in Austria.  But this is mostly for real-time applications and Nasa might be more focused on mundane stuff like image processing.

§         They should certainly use self-checking mechanisms like “keep alive” checks, so that if something does experience a problem it has a good chance to notice the problem and restart itself.  The system as a whole needs fault-tolerance mechanisms to make restart easy and quick.

§         They might be able to hack the compiler to generate redundant code and duplicated data.  This is a different way to accomplish the basic idea of running applications twice, side by side.  Perhaps a bit more complex?

 

Bottom line, though, is that 1 bit flipping per 8 hours is very uncommon and a level of errors that probably occurs in pretty much any computer system!  We don’t think of computers as unreliable in this sense, and we do want to protect against such problems.  Nasa may be more directly confronted with the reality of the situation, but relatively mild solutions may be quite adequate for their needs!