CS
Colloquium
Tuesday, April 1,  2003
4:15 PM
B17 Upson Hall
Aaron
Brown
University of California, Berkeley
Creating
Undo for Operators: A Human-Centric Approach to Recovery 
 Nearly
every productivity application on the market today offers its users the ability
to undo their actions, allowing them to recover from their mistakes and to
experiment with unfamiliar features in a forgiving environment. 
Surprisingly, this safety net of undo is virtually nonexistent in the
domain of systems operations and management, where human error is the largest
single contributor to system and service outages. 
This talk will address the challenge of creating an undo facility for
system operators, introducing an architecture for adding operator-undo
functionality as a wrapper around existing network service applications such as
electronic mail, online shopping, and auction services.
Our
operator-undo architecture leverages proven techniques from the systems and
database domains, including non-overwriting storage, redo recovery, proxy-based
logging of user requests, and predicate-based consistency management. 
It adapts and integrates these techniques to provide a recovery tool that
offers a virtual time-travel interface to the system operator. 
Using our undo system, an operator can rewind system state to nullify the
effects of human error, software misbehavior, or external attack; can
retroactively insert arbitrary changes into the system to repair the problem or
forestall future occurrences; and can replay the system forward in time, merging
the original system timeline with the retroactively-inserted repairs and
compensating for any inconsistencies that arise as a result.
A
human-centric recovery tool like operator-undo requires an evaluation methodology that captures the influence of its human users on its effectiveness. To that end, the talk will also introduce an evaluation methodology that combines traditional systems benchmarking and human studies techniques, and will describe how we are using this methodology to evaluate the dependability and recovery benefits offered by a prototype implementation of operator-undo for an e-mail store service.