Checkpoint Recovery Simulator

Overview   |   Downloads


We have performed an experimental evaluation of several checkpoint recovery algorithms using a detailed simulation model. The advantage of using a simulator is that the algorithms can be easily compared over different hardware configurations. You will find a summary of our experimental study in our blog and the full study in our paper.

Our simulator models several performance aspects of the checkpoint recovery algorithms. The main equations used in our simulator are shown in the following figure:

The main performance parameters that affect the algorithms are memory and disk bandwidths, memory latency, as well as overheads associated with bit manipulation operations and locking.

Given values to these performance parameters, you may experiment how the algorithms would behave under any hardware configuration of interest. In particular, if you plan to use our simulator to model a hardware configuration you own, then you can use a set of micro-benchmarks to tune the following simulation parameters:

  1. Memory bandwidth: We measure effective memory bandwidth by repeated memcpy calls using aligned data, each call copying an order of magnitude more data than the size of the L2 cache of the machine.
  2. Memory latency: We measure memory latency using another memcpy benchmark with memory reference patterns mixing sequential and random access. The intent was to take into account both hardware cache-miss latency and the startup cost of the memcpy implementation.
  3. Lock overhead: We measure lock overhead as the aggregate cost of uncontested calls to pthread_spinlock, again with a mixture of sequential and random access patterns.
  4. Bit test overhead: The bit test overhead is intended to model the cost of the dirty-bit check that must be added to the simulation loop for the copy-on-update variants of the algorithms. This benchmark measures the incremental cost of naive code to count dirty bits, roughly half of which are set. The code is added to a loop intended to model the memory reference behavior of the update phase of the game simulation loop.
  5. Disk bandwidth: We measure the effective disk bandwidth by performing large sequential writes to a block device allocated to our recovery disk.


All the software below is provided under an Apache 2.0 license.