CS Colloquium
Tuesday, April 1,  2003
4:15 PM
B17 Upson Hall

Aaron Brown
University of California, Berkeley

 

Creating Undo for Operators: A Human-Centric Approach to Recovery 

 Nearly every productivity application on the market today offers its users the ability to undo their actions, allowing them to recover from their mistakes and to experiment with unfamiliar features in a forgiving environment.  Surprisingly, this safety net of undo is virtually nonexistent in the domain of systems operations and management, where human error is the largest single contributor to system and service outages.  This talk will address the challenge of creating an undo facility for system operators, introducing an architecture for adding operator-undo functionality as a wrapper around existing network service applications such as electronic mail, online shopping, and auction services.

Our operator-undo architecture leverages proven techniques from the systems and database domains, including non-overwriting storage, redo recovery, proxy-based logging of user requests, and predicate-based consistency management.  It adapts and integrates these techniques to provide a recovery tool that offers a virtual time-travel interface to the system operator.  Using our undo system, an operator can rewind system state to nullify the effects of human error, software misbehavior, or external attack; can retroactively insert arbitrary changes into the system to repair the problem or forestall future occurrences; and can replay the system forward in time, merging the original system timeline with the retroactively-inserted repairs and compensating for any inconsistencies that arise as a result.

A human-centric recovery tool like operator-undo requires an evaluation methodology that captures the influence of its human users on its effectiveness. To that end, the talk will also introduce an evaluation methodology that combines traditional systems benchmarking and human studies techniques, and will describe how we are using this methodology to evaluate the dependability and recovery benefits offered by a prototype implementation of operator-undo for an e-mail store service.