SNS Staged Simulator : Motivation

SNS uses staged simulation, a technique for improving the run time performance and scale of discrete event simulators. Typical network simulations are limited in speed and scale due to redundant computations, both within a single simulation run and between successive runs. Staged simulation restructures discrete event simulators to operate in stages that precompute, cache, and reuse partial results to drastically reduce redundant computation within and across simulations.

Experience with applying staged simulation in the SNS simulator shows that staging can improve execution time by an order of magnitude or more and enable the simulation of wireless networks with tens of thousands of nodes.

Why do we need staging?

Network simulations are critical for the design and evaluation of distributed systems and networking protocols. Especially in newly emerging domains such as mobile ad hoc and sensor networks, simulations are crucial to evaluating new systems and protocols across a range of deployment scenarios. While thorough evaluations require accurate, efficient and scalable network simulators, achieving all three of these properties simultaneously is a challenging task. We assert that a significant source of inefficiency in discrete event simulators stems from redundant computation, and that this wasted computation presents a significant limit to simulation speed and scale.

Staging addresses this inefficiency by identify and reducing two different classes of redundancy in traditional discrete-event network simulators.

The first class of redundant computation occurs within a single run of the simulator, and stems from an overly strict notion of accuracy. Traditional network simulators are conservative, that is, they reevaluate complex functions whenever their results may have changed, even though in reality the results may have changed very little, if at all, since the last time they were evaluated. An illustrative example is the computation of a node's one hop neighbor-set in wireless network simulations. This is an expensive and frequently-used primitive in wireless simulations. It computes the positions of all nodes relative to the sending node, calculates the power level of the signal at each receiver, and determines the set of nodes that can capture transmitted packets. In the most general case, nodes may change position, alter sending and receiving power thresholds, or modify antenna models or parameters at any time. A conservative wireless network simulator which has computed the neighbors of a wireless node at time t will still recompute the neighbor-set from scratch at time step t+ε. In reality, however, few wireless simulation scenarios will change dramatically over short time-scales. Nodes move smoothly and relatively slowly, if at all. Other variables, such as antenna model, reception threshold, and transmission power change infrequently. In general, the neighbor-set computed at time t is likely to remain valid for some time, changing slowly and gradually as nodes move into or out of range. Much of the time spent re-computing neighbor-sets from scratch at each time step is therefore effectively wasted.

A second class of redundant computation occurs between multiple runs of a simulator, and stems from performing calculations which were already computed in previous runs. Simulations are often performed in batches of tens or hundreds of simulation runs, with only slightly varying simulation parameters within each batch. While each run will differ in some inputs, such as the random number seed, network traffic pattern, or protocol parameters, often there is still a significant overlap between the work performed in multiple invocations of the simulator. Executing each simulation independently and without the benefit of past computations leads to redundantly computing many identical results.

How does staging work?

Function caching and reuse forms the foundation of the staging approach. Without loss of generality, each event in a discrete-event simulator can be treated as a sequence of function invocations. A staged simulator caches arguments, results and side-effects of function invocations, and later avoids redundant computations by reusing results and reapplying side effects when the same function is invoked with the same arguments. This space-for-time trade-off ensures that each expensive function invocation is computed at most once. While function caching and reuse is effective at eliminating redundant computations, it is not directly suitable for application in realistic wireless simulators because it is overly sensitive to changes in the arguments to function invocations. In typical wireless simulations, many computations depend on continuously varying inputs, such as the current simulator time, and any change in inputs, no matter how small, precludes reuse of previous results. Similarly, real-valued parameters make it difficult to achieve effective cache hit rates. These properties make naive function caching ineffective for a large class of computations in a simulator.

Event-restructuring improves on function caching by modifying the low-level events in a discrete event simulator such that their results are reusable even when a change in inputs would normally preclude reuse. We introduce three comprehensive techniques, called currying, incremental computation and auxiliary results, for decomposing the events in a discrete-event simulator into a series of events with an equivalent effect. The equivalent transformations retain the semantics of the original event and preserve accuracy. The resulting smaller subcomputations, whose dependencies are better isolated, can then be reused more frequently, leading to significant improvements in simulation speed and scale.

Time-shifting takes advantage of the small, fine-grained events created through event-restructuring by reordering them into an equivalent, but more efficient, schedule. Restructured, fine-grained events have fewer inputs and tend to be time-independent. Consequently, restructuring provides the simulator with freedom to judiciously pick when to schedule these events. Time-shifting can speed up simulations through architectural or algorithmic improvements. First, time-shifting can simply reorder the events into a sequence that is better suited to the machine architecture on which the simulator is running. For instance, time-shifting events to access memory sequentially instead of randomly can speed up execution by taking advantage of memory caching and prefetching. And second, time-shifting and batching similar operations can enable a series of piecemeal, small, consecutive events to be computed by a single, more efficient algorithm. For instance, time-shifting sufficiently many node distance calculations may allow a long sequence of such computations to be replaced with an efficient all-pairs shortest paths algorithm.

Overall, staging relies on function caching and reuse to eliminate redundancies, restructuring to break events into smaller computations that can be reused readily, and time-shifting to pick an efficient schedule for function evaluation. In general, staging can be applied to a single simulation run, which we term intra-simulation staging, or across a set of similar simulation runs, which we term inter-simulation staging.