Recovery for Games and Simulations
Massively multiplayer online games (MMOs) are persistent virtual worlds that allow tens of thousands of users to interact in fictional settings. Users typically select a virtual avatar and collaborate with other users to solve puzzles or complete quests. These games are extremely popular, and successful MMOs have millions of subscribers and have generated billions of dollars in revenue. Unlike single player computer games, MMOs must persist across user sessions. Players can leave the game at any time, and they expect their achievements to be reflected in the world when they rejoin. Similarly, it is unacceptable for the game to lose player data in the event of a crash. These demands make it essential for MMOs to ensure that their state is durable. As such, MMO developers have been forced to develop ad-hoc solutions or invest in expensive special purpose hardware to achieve some degree of fault tolerance. Our research is focusing on providing low cost solutions with novel checkpoint recovery algorithms. Our solutions have been targeted towards MMOs but can be extended to eventually provide fault tolerance for data-driven applications in the cloud.
Typical MMOs follow the architecture outlined below:
Clients join the virtual world through a connection server that connects them to a single shard. Shards are independent versions of the virtual world aimed at improving scalability. Shards are not synchronized, and players on one shard cannot interact with players on another.
Most MMOs operate on a tick-based game loop model. Each iteration of the game loop is called a tick. In a tick, the game logic needs to perform several tasks: receive player input, process player actions, calculate and then apply updates to the game state. The updates in the game logic can be divided into two categories: transactional updates and local updates. Transactional updates consist of a small subset of updates, such as item and currency exhange, which require transactional guarantees and are typically implemented using a standard DBMS. In constrast, local updates, which contribute a vast majority of updates, are designed to avoid inconsistent states and hence do not require transactional guarantees. For example, local updates include behaviors such as character movement, which is processed using collision detection algorithms.
We observe that this tick model corresponds to a restricted form of snapshot isolation. In particular, only the snapshot at the end of a tick is necessary to determine the local updates for the next tick. Furthermore, since updates are completely applied during a single tick, the game is in a consistent state at the end of every tick. This provides a natural point of consistency at which to take checkpoints. It is important to note that not only MMOs but also many other cloud applications follow the tick model. That creates a great motivation to develop novel checkpointing techniques for these types of applications.
Providing durability for local updates with uniform latency and low overhead is the major challenge. MMO servers often have to process hundreds of thousands updates per second and execute ticks at a frequency of roughly 10Hz. It is extremely important that the server maintains a constant tick rate and avoid latency spikes that would create "hiccups" in the game. Uneven latency affects gameplay and requires explicit compensation by MMO developers. In addition, the entire checkpointing process must fit into the game simulation loop so that the game loop still runs at 10 ticks per second. In order to achieve that, the checkpointing process must incur low overhead, as checkpoint overhead translates into an effectively lower tick rate on the server.
While revisiting main-memory database recovery techniques in the context of MMOs, we have performed an experimental evaluation of several checkpoint recovery algorithms using a detailed simulation model. As a first result of our investigation, we concluded that existing checkpoint-recovery algorithms developed for main memory DBMSs could be applied to MMO workloads but there was no single algorithm which outperformed all others over a wide range of update rates. Some algorithms require locking which excessively degrade the performance at high update rates while the others use bulk-coping of the game state which introduce unacceptable latency spikes.
Recently, we have proposed a set of new checkpoint-recovery algorithms that dramatically reduce overhead and latency. Our work is to appear at SIGMOD 2011 with both a paper and a system demonstration.
- "An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games." VLDB, Lyon, France, August 2009.
Tuan Cao, Marcos Vaz Salles, Benjamin Sowell, Yao Yue, Alan Demers, Johannes Gehrke, Walker White,
Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications.
In Proc. of ACM SIGMOD 2011.
Tuan Cao, Benjamin Sowell, Marcos Vaz Salles, Alan Demers, Johannes Gehrke,
BRRL: A Recovery Library for Main-Memory Applications in the Cloud (Demo Paper).
In Proc. of ACM SIGMOD 2011.
Marcos Vaz Salles, Tuan Cao, Benjamin Sowell, Alan Demers, Johannes Gehrke, Christoph Koch, and Walker White,
An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games.
In Proc. of the 2009 VLDB Conf. on Very Large Databases (VLDB 2009).
This research has been supported by the National Science Foundation under Grant IIS-0725260, by the Air Force Office of Scientific Research, by the iAd Project funded by the Research Council of Norway and by a grant from Microsoft. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.