This lecture introduces the concept of scalable and autonomic mechanisms that might support complex large-scale distributed systems.

To motivate the topic we look at some of the ambitious large-scale technologies people are hoping to build today.  The core of the first part of the lecture looks at an example from the US Air Force of a complex, modular, componentized system that would be distributed over huge networks and link all sorts of information sources in support of tactical decision making both in command centers and in the field.  The purpose is to give the students a glimpse of a large scale system that isn’t at all like the Akamai CDN or Amazon web datacenter…

Then we start to look at how hard it can be for a system like this to orchestrate reactions to faults in a coordinated way.  Drilling down, we run into problems with fault-handling and detection even in a trivial 2-process scenario!