New Page 0

Continuing to explore the notion of time, this lecture focuses on the meaning of simultaneity, which turns out to be hard to define in a distributed setting since any concurrent events might potentially have happened simultaneously... but might not have. This lecture is a bit short because the ideas are hard to get used to, so take your time and solicit student suggestions and questions along the way.

For example, suppose that while process p performs actions A and B, process q performs C (and they don't communicate during this period). C might happen before A, at the same time as A, at the same time as B, or after B. We can't tell: all are plausible "simultaneous states".

Chandy and Lamport formalize this notion (consistent snapshots) and show how an algorithm can capture some plausible consistent snapshot. The algorithm won't find a genuinely simultaneous state since it runs over a period of time, but with a faster scheduler here and a slower one there, the state is one that might have occurred simultaneously. And this is quite a useful concept, as we'll see.

First we look at an example or two of how we might use consistent snapshots. For the Cornell course, the bank "audit" application will need this mechanism.

Next we explore the intuition that as a temporal "wave" spreads through a system, we're actually sitting on the frontier of a consistent cut and could use that opportunity to make a checkpoint and start recording channel contents.

We develop this into the CL snapshot algorithm... first in words but then in pictures.

The students who have a practical bias may worry that fifo channels are a potentially big cost: will my little 32-byte message have to wait behind your 10Mb message? Explain that just as we use threads for concurrency in our applications, we can also imagine running a snapshot "side by side" with other application-specific traffic. In effect, arriving messages would be cloned: one path would run application code while the other puts messages into FIFO order (delaying them if needed) and then logging them per the algorithm. The mechanism of snapshots would thus avoid imposing a big overhead on the mechanism of the underlying software.

This leads to a wrapup in which we talk about application-specific "specializations" of general mechanisms. I find it helpful to discuss the strange idea that the snapshots we produce satisfy a "logical" definition of simultaneity, yet are anything but instantaneous... and yet that doing instantaneous snapshots might be impractical and can yield inconsistent output! Our audits of the banking project, for example, are never truly instantaneous audits. Yet they are consistent and that is, one hopes, the real need...